Skip to content

Model Self-Deployment

Model Self-Deployment allows you to deploy mainstream large models provided by the platform as dedicated instances to a specified region, giving you an exclusive inference service. Compared to shared APIs, self-deployed instances offer predictable concurrency and more stable response latency, making them ideal for production scenarios that require guaranteed throughput, low latency, or data isolation.

Features

  • Elastic replica configuration: Set the maximum number of replicas. The platform automatically scales on demand — each additional replica increases the maximum concurrency accordingly.
  • Flexible resource specs: Configure CPU Cores, Memory, and TPU Count on demand to match the inference resource requirements of different models.
  • Multi-region deployment: Choose from different deployment regions to meet data locality or proximity requirements.
  • OpenAI-compatible API: Self-deployed instances provide an OpenAI-compatible API endpoint (/v1/chat/completions), enabling drop-in replacement for existing integrations with no code changes.
  • One-click API credentials: After successful deployment, view Base URL, Model ID, and API Key directly from the management page, with ready-to-use example code in CURL, Python, and JavaScript.

Pricing

Model Self-Deployment is billed based on the actual resource specs configured. The cost is displayed in real time when creating a deployment (Configuration Cost ¥XX /Hour). Billing starts after the instance is successfully created and stops when the instance is released. Refer to the compute marketplace for the latest pricing.

Using Model Self-Deployment

Create a Deployment

Navigate to the Model Deployment page in the console. The page displays all currently available models.

Filtering and Search

  • Use the Category tags at the top to filter by model type. Multiple selections are supported. Click the ✕ on a selected tag to deselect it, or click the clear icon to reset all filters.
  • Use the Provider tags to filter by model provider.
  • Enter a model name in the search box for exact lookup.

Deploy a Model

  1. Click Deploy Model on a model card to open the deployment configuration dialog.

  2. Fill in the following fields:

    FieldDescription
    NameProject name, up to 32 characters, required
    Model NameAuto-filled, read-only
    Select RegionChoose a deployment region, required
    Max ReplicasMaximum number of replicas, determines maximum concurrency. See the range hint in parentheses
    Concurrency LimitMaximum concurrent requests per replica, determined by model specs, read-only
    CPU CoresNumber of CPU cores. See the range hint in parentheses
    Memory (GB)Memory size. See the range hint in parentheses
    TPU CountNumber of TPUs. See the range hint in parentheses
  3. The Configuration Cost is shown in real time at the bottom of the dialog. After confirming, click Deploy Now to create the deployment.

If the message "Insufficient resources. Please adjust the configuration and region, or wait for resources to be released" appears, try switching to a different region or reducing the replica/resource configuration.


Manage Deployments

Navigate to the Self Deploy Control page in the console. All deployment records are displayed as a card list.

Instance Status

StatusMeaning
CreatingInstance is being created
DeployingDeployment is in progress
Deploy SuccessDeployment succeeded, ready for API calls
Deploy FailedDeployment failed
DeletingInstance is being released

Card Information

Each card displays: Project Name, Create Time, Project ID, model name and ID, Current Workers, Concurrency Limit, Region, and hourly Config Cost.


Get API Credentials

Once the instance status is Deploy Success, click the Get API button on the right side of the card. A side drawer opens with the following information:

FieldDescription
Base URLThe base address for API requests
API EndpointsFixed at /v1/chat/completions
Model IDThe model identifier to pass when making API calls
API KeyAuthentication key

All fields have a copy icon for quick copying. At the bottom of the drawer, CURL, Python, and JavaScript example code is provided, ready to copy and use.


Release an Instance

If a deployment is no longer needed, click the Release link on the right side of the card, then click Confirm in the confirmation dialog.

Note: Instance content cannot be recovered after release. Make sure you have saved any required configuration before proceeding.