SGLang¶
This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using SGLang and dstack.
Apply a configuration¶
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang.
type: service
name: deepseek-r1
image: lmsysorg/sglang:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--port 8000
--trust-remote-code
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
resources:
gpu: 24GB
type: service
name: deepseek-r1
image: lmsysorg/sglang:v0.4.1.post4-rocm620
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--port 8000
--trust-remote-code
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
gpu: MI300x
disk: 300GB
To run a configuration, use the dstack apply command.
$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49
Submit the run deepseek-r1? [y/n]: y
Provisioning...
---> 100%
If no gateway is created, the service endpoint will be available at <dstack server URL>/proxy/services/<project name>/<run name>/.
curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"stream": true,
"max_tokens": 512
}'
If a gateway is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at
https://deepseek-r1.<gateway domain>/.
Configuration options¶
PD disaggregation¶
To run SGLang with PD disaggregation, use replicas groups: one for a router (for example, SGLang Model Gateway), one for prefill workers, and one for decode workers.
type: service
name: prefill-decode
image: lmsysorg/sglang:latest
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install sglang_router
- |
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15s
Currently, auto-scaling only supports rps as the metric. TTFT and ITL metrics are coming soon.
Cluster
PD disaggregation requires the service to run in a fleet with placement set to cluster, because the replicas require an interconnect between instances.
While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
Source code¶
The source-code of these examples can be found in
examples/llms/deepseek/sglang and examples/inference/sglang.
What's next?¶
- Read about services and gateways
- Browse the SgLang DeepSeek Usage, Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X