Skip to content

SGLang

This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using SGLang and dstack.

Apply a configuration

Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang.

type: service
name: deepseek-r1

image: lmsysorg/sglang:latest
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B

commands:
  - python3 -m sglang.launch_server
     --model-path $MODEL_ID
     --port 8000
     --trust-remote-code

port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B

resources:
   gpu: 24GB

type: service
name: deepseek-r1

image: lmsysorg/sglang:v0.4.1.post4-rocm620
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B

commands:
  - python3 -m sglang.launch_server
     --model-path $MODEL_ID
     --port 8000
     --trust-remote-code

port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B

resources:
  gpu: MI300x
  disk: 300GB

To run a configuration, use the dstack apply command.

$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml

 #  BACKEND  REGION     RESOURCES                         SPOT  PRICE
 1  runpod   EU-RO-1   24xCPU, 283GB, 1xMI300X (192GB)    no    $2.49

Submit the run deepseek-r1? [y/n]: y

Provisioning...
---> 100%

If no gateway is created, the service endpoint will be available at <dstack server URL>/proxy/services/<project name>/<run name>/.

curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "stream": true,
      "max_tokens": 512
    }'

If a gateway is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at https://deepseek-r1.<gateway domain>/.

Configuration options

PD disaggregation

To run SGLang with PD disaggregation, use replicas groups: one for a router (for example, SGLang Model Gateway), one for prefill workers, and one for decode workers.

type: service
name: prefill-decode
image: lmsysorg/sglang:latest

env:
  - HF_TOKEN
  - MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
  - count: 1
    # For now replica group with router must have count: 1
    commands:
      - pip install sglang_router
      - |
        python -m sglang_router.launch_router \
          --host 0.0.0.0 \
          --port 8000 \
          --pd-disaggregation \
          --prefill-policy cache_aware
    router:
      type: sglang
    resources:
      cpu: 4

  - count: 1..4
    scaling:
      metric: rps
      target: 3
    commands:
      - |
        python -m sglang.launch_server \
          --model-path $MODEL_ID \
          --disaggregation-mode prefill \
          --disaggregation-transfer-backend nixl \
          --host 0.0.0.0 \
          --port 8000 \
          --disaggregation-bootstrap-port 8998
    resources:
      gpu: H200

  - count: 1..8
    scaling:
      metric: rps
      target: 2
    commands:
      - |
        python -m sglang.launch_server \
          --model-path $MODEL_ID \
          --disaggregation-mode decode \
          --disaggregation-transfer-backend nixl \
          --host 0.0.0.0 \
          --port 8000
    resources:
      gpu: H200

port: 8000
model: zai-org/GLM-4.5-Air-FP8

# Custom probe is required for PD disaggregation.
probes:
  - type: http
    url: /health
    interval: 15s

Currently, auto-scaling only supports rps as the metric. TTFT and ITL metrics are coming soon.

Cluster

PD disaggregation requires the service to run in a fleet with placement set to cluster, because the replicas require an interconnect between instances.

While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.

Source code

The source-code of these examples can be found in examples/llms/deepseek/sglang and examples/inference/sglang.

What's next?

  1. Read about services and gateways
  2. Browse the SgLang DeepSeek Usage, Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X