Skip to content

Commit fad6f6c

Browse files
docs: add docs for DGDR usage -- golden path (#7304)
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
1 parent a3ca551 commit fad6f6c

File tree

8 files changed

+474
-123
lines changed

8 files changed

+474
-123
lines changed

components/src/dynamo/profiler/deploy/profile_sla_aic_dgdr.yaml

Lines changed: 0 additions & 12 deletions
This file was deleted.

components/src/dynamo/profiler/deploy/profile_sla_dgdr.yaml

Lines changed: 0 additions & 13 deletions
This file was deleted.

components/src/dynamo/profiler/deploy/profile_sla_moe_dgdr.yaml

Lines changed: 0 additions & 21 deletions
This file was deleted.

docs/components/profiler/profiler-examples.md

Lines changed: 127 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -8,60 +8,45 @@ Complete examples for profiling with DGDRs.
88

99
## DGDR Examples
1010

11-
### Dense Model: AIPerf on Real Engines
11+
### Dense Model: Rapid
1212

13-
Standard online profiling with real GPU measurements:
13+
Fast profiling (~30 seconds):
1414

1515
```yaml
1616
apiVersion: nvidia.com/v1beta1
1717
kind: DynamoGraphDeploymentRequest
1818
metadata:
19-
name: vllm-dense-online
19+
name: qwen-0-6b
2020
spec:
2121
model: "Qwen/Qwen3-0.6B"
22-
backend: vllm
23-
image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
24-
25-
workload:
26-
isl: 3000
27-
osl: 150
28-
29-
sla:
30-
ttft: 200.0
31-
itl: 20.0
32-
33-
autoApply: true
22+
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
3423
```
3524
36-
### Dense Model: AI Configurator Simulation
25+
### Dense Model: Thorough
3726
38-
Fast offline profiling (~30 seconds, TensorRT-LLM only):
27+
Profiling with real GPU measurements:
3928
4029
```yaml
4130
apiVersion: nvidia.com/v1beta1
4231
kind: DynamoGraphDeploymentRequest
4332
metadata:
44-
name: trtllm-aic-offline
33+
name: vllm-dense-online
4534
spec:
46-
model: "Qwen/Qwen3-32B"
47-
backend: trtllm
48-
image: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
49-
50-
workload:
51-
isl: 4000
52-
osl: 500
53-
54-
sla:
55-
ttft: 300.0
56-
itl: 10.0
57-
58-
autoApply: true
35+
model: "Qwen/Qwen3-0.6B"
36+
backend: vllm
37+
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
38+
searchStrategy: thorough
5939
```
6040
6141
### MoE Model
6242
6343
Multi-node MoE profiling with SGLang:
6444
45+
> [!IMPORTANT]
46+
> The PVC referenced by `modelCache.pvcName` must already exist in the same namespace and contain
47+
> the model weights at the specified `pvcModelPath`. The DGDR controller does not create or
48+
> populate the PVC — it only mounts it into the profiling job and deployed workers.
49+
6550
```yaml
6651
apiVersion: nvidia.com/v1beta1
6752
kind: DynamoGraphDeploymentRequest
@@ -70,53 +55,138 @@ metadata:
7055
spec:
7156
model: "deepseek-ai/DeepSeek-R1"
7257
backend: sglang
73-
image: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
74-
75-
workload:
76-
isl: 2048
77-
osl: 512
78-
79-
sla:
80-
ttft: 300.0
81-
itl: 25.0
58+
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
8259
8360
hardware:
8461
numGpusPerNode: 8
8562
86-
autoApply: true
63+
modelCache:
64+
pvcName: "model-cache"
65+
pvcModelPath: "deepseek-r1" # path within the PVC
8766
```
8867

89-
### Using Existing DGD Config (ConfigMap)
68+
### Private Model
9069

91-
Reference a custom DGD configuration via ConfigMap:
70+
For gated or private HuggingFace models, pass your token via an environment variable injected
71+
into the profiling job. Create the secret first:
9272

9373
```bash
94-
# Create ConfigMap from your DGD config file
95-
kubectl create configmap deepseek-r1-config \
96-
--from-file=/path/to/your/disagg.yaml \
97-
--namespace $NAMESPACE \
98-
--dry-run=client -o yaml | kubectl apply -f -
74+
kubectl create secret generic hf-token-secret \
75+
--from-literal=HF_TOKEN="${HF_TOKEN}" \
76+
-n ${NAMESPACE}
9977
```
10078

79+
Then reference it in your DGDR:
80+
10181
```yaml
10282
apiVersion: nvidia.com/v1beta1
10383
kind: DynamoGraphDeploymentRequest
10484
metadata:
105-
name: deepseek-r1
85+
name: llama-private
10686
spec:
107-
model: deepseek-ai/DeepSeek-R1
108-
backend: sglang
109-
image: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
87+
model: "meta-llama/Llama-3.1-8B-Instruct"
88+
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
89+
90+
overrides:
91+
profilingJob:
92+
template:
93+
spec:
94+
containers: [] # required placeholder; leave empty to inherit defaults
95+
initContainers:
96+
- name: profiler
97+
env:
98+
- name: HF_TOKEN
99+
valueFrom:
100+
secretKeyRef:
101+
name: hf-token-secret
102+
key: HF_TOKEN
103+
```
104+
105+
### Custom SLA Targets
106+
107+
Control how the profiler optimizes your deployment by specifying latency targets and workload
108+
characteristics.
109+
110+
**Explicit TTFT + ITL targets** (default mode):
111+
112+
```yaml
113+
apiVersion: nvidia.com/v1beta1
114+
kind: DynamoGraphDeploymentRequest
115+
metadata:
116+
name: low-latency-dense
117+
spec:
118+
model: "Qwen/Qwen3-0.6B"
119+
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
120+
121+
sla:
122+
ttft: 500 # Time To First Token target in milliseconds
123+
itl: 20 # Inter-Token Latency target in milliseconds
110124
111125
workload:
112-
isl: 4000
113-
osl: 500
126+
isl: 2000 # expected input sequence length (tokens)
127+
osl: 500 # expected output sequence length (tokens)
128+
```
114129

130+
**End-to-end latency target** (alternative to ttft+itl):
131+
132+
```yaml
133+
spec:
134+
...
135+
sla:
136+
e2eLatency: 10000 # total request latency budget in milliseconds
137+
```
138+
139+
**Optimization objective without explicit targets** (maximize throughput or minimize latency):
140+
141+
```yaml
142+
spec:
143+
...
115144
sla:
116-
ttft: 300
117-
itl: 10
145+
optimizationType: throughput # or: latency
146+
```
147+
148+
### Overrides
149+
150+
Use `overrides` to customize the profiling job pod spec — for example to add tolerations for
151+
GPU node taints or inject environment variables.
152+
153+
**GPU node toleration** (common on GKE and shared clusters):
118154

119-
autoApply: true
155+
```yaml
156+
apiVersion: nvidia.com/v1beta1
157+
kind: DynamoGraphDeploymentRequest
158+
metadata:
159+
name: dense-with-tolerations
160+
spec:
161+
model: "Qwen/Qwen3-0.6B"
162+
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
163+
164+
overrides:
165+
profilingJob:
166+
template:
167+
spec:
168+
containers: [] # required placeholder; leave empty to inherit defaults
169+
tolerations:
170+
- key: nvidia.com/gpu
171+
operator: Exists
172+
effect: NoSchedule
173+
```
174+
175+
**Override the generated DynamoGraphDeployment** (e.g., to use a custom worker image):
176+
177+
```yaml
178+
spec:
179+
...
180+
overrides:
181+
dgd:
182+
apiVersion: nvidia.com/v1alpha1
183+
kind: DynamoGraphDeployment
184+
spec:
185+
services:
186+
VllmWorker:
187+
extraEnvs:
188+
- name: CUSTOM_ENV
189+
value: "my-value"
120190
```
121191

122192
## SGLang Runtime Profiling

docs/index.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ navigation:
4747
contents:
4848
- page: Detailed Installation Guide
4949
path: kubernetes/installation-guide.md
50+
- page: Deploying Your First Model
51+
path: kubernetes/dgdr.md
5052
- page: Dynamo Operator
5153
path: kubernetes/dynamo-operator.md
5254
- page: Service Discovery

docs/kubernetes/README.md

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -82,26 +82,12 @@ Each backend has deployment examples and configuration options:
8282

8383
## 3. Deploy Your First Model
8484

85-
```bash
86-
export NAMESPACE=dynamo-system
87-
kubectl create namespace ${NAMESPACE}
88-
89-
# to pull model from HF
90-
export HF_TOKEN=<Token-Here>
91-
kubectl create secret generic hf-token-secret \
92-
--from-literal=HF_TOKEN="$HF_TOKEN" \
93-
-n ${NAMESPACE};
85+
Follow the **[Deploying Your First Model](dgdr.md)** guide for a complete end-to-end
86+
walkthrough using `DynamoGraphDeploymentRequest` (DGDR) — Dynamo's recommended path that
87+
handles profiling and configuration automatically.
9488

95-
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
96-
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
97-
98-
# Check status
99-
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
100-
101-
# Test it
102-
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
103-
curl http://localhost:8000/v1/models
104-
```
89+
The tutorial deploys `Qwen/Qwen3-0.6B` with vLLM and walks you through every step: creating
90+
the DGDR, watching the profiling lifecycle, and sending your first inference request.
10591

10692
For SLA-based autoscaling, see [SLA Planner Guide](../components/planner/planner-guide.md).
10793

0 commit comments

Comments
 (0)