Model Self-Deployment
Model Self-Deployment allows you to deploy mainstream large models provided by the platform as dedicated instances to a specified region, giving you an exclusive inference service. Compared to shared APIs, self-deployed instances offer predictable concurrency and more stable response latency, making them ideal for production scenarios that require guaranteed throughput, low latency, or data isolation.
Features
- Elastic replica configuration: Set the maximum number of replicas. The platform automatically scales on demand — each additional replica increases the maximum concurrency accordingly.
- Flexible resource specs: Configure CPU Cores, Memory, and TPU Count on demand to match the inference resource requirements of different models.
- Multi-region deployment: Choose from different deployment regions to meet data locality or proximity requirements.
- OpenAI-compatible API: Self-deployed instances provide an OpenAI-compatible API endpoint (
/v1/chat/completions), enabling drop-in replacement for existing integrations with no code changes. - One-click API credentials: After successful deployment, view Base URL, Model ID, and API Key directly from the management page, with ready-to-use example code in CURL, Python, and JavaScript.
Pricing
Model Self-Deployment is billed based on the actual resource specs configured. The cost is displayed in real time when creating a deployment (Configuration Cost ¥XX /Hour). Billing starts after the instance is successfully created and stops when the instance is released. Refer to the compute marketplace for the latest pricing.
Using Model Self-Deployment
Create a Deployment
Navigate to the Model Deployment page in the console. The page displays all currently available models.
Filtering and Search
- Use the Category tags at the top to filter by model type. Multiple selections are supported. Click the ✕ on a selected tag to deselect it, or click the clear icon to reset all filters.
- Use the Provider tags to filter by model provider.
- Enter a model name in the search box for exact lookup.
Deploy a Model
Click Deploy Model on a model card to open the deployment configuration dialog.
Fill in the following fields:
Field Description Name Project name, up to 32 characters, required Model Name Auto-filled, read-only Select Region Choose a deployment region, required Max Replicas Maximum number of replicas, determines maximum concurrency. See the range hint in parentheses Concurrency Limit Maximum concurrent requests per replica, determined by model specs, read-only CPU Cores Number of CPU cores. See the range hint in parentheses Memory (GB) Memory size. See the range hint in parentheses TPU Count Number of TPUs. See the range hint in parentheses The Configuration Cost is shown in real time at the bottom of the dialog. After confirming, click Deploy Now to create the deployment.
If the message "Insufficient resources. Please adjust the configuration and region, or wait for resources to be released" appears, try switching to a different region or reducing the replica/resource configuration.
Manage Deployments
Navigate to the Self Deploy Control page in the console. All deployment records are displayed as a card list.
Instance Status
| Status | Meaning |
|---|---|
| Creating | Instance is being created |
| Deploying | Deployment is in progress |
| Deploy Success | Deployment succeeded, ready for API calls |
| Deploy Failed | Deployment failed |
| Deleting | Instance is being released |
Card Information
Each card displays: Project Name, Create Time, Project ID, model name and ID, Current Workers, Concurrency Limit, Region, and hourly Config Cost.
Get API Credentials
Once the instance status is Deploy Success, click the Get API button on the right side of the card. A side drawer opens with the following information:
| Field | Description |
|---|---|
| Base URL | The base address for API requests |
| API Endpoints | Fixed at /v1/chat/completions |
| Model ID | The model identifier to pass when making API calls |
| API Key | Authentication key |
All fields have a copy icon for quick copying. At the bottom of the drawer, CURL, Python, and JavaScript example code is provided, ready to copy and use.
Release an Instance
If a deployment is no longer needed, click the Release link on the right side of the card, then click Confirm in the confirmation dialog.
Note: Instance content cannot be recovered after release. Make sure you have saved any required configuration before proceeding.