Skip to content

Commit a4000b0

Browse files
authored
docs: cherry-pick snapshot checkpointing guide updates (#7245)
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
1 parent d3b92d3 commit a4000b0

File tree

7 files changed

+573
-784
lines changed

7 files changed

+573
-784
lines changed
Lines changed: 74 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -1,177 +1,123 @@
11
# Dynamo Snapshot Helm Chart
22

3-
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations.
3+
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in beta/preview. The DaemonSet runs in privileged mode to perform CRIU checkpoint and restore operations.
44
5-
This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including:
6-
- Persistent Volume Claim (PVC) for checkpoint storage
7-
- DaemonSet running the CRIU checkpoint agent
8-
- RBAC resources (ServiceAccount, Role, RoleBinding)
9-
- Seccomp profile for blocking io_uring syscalls
5+
This chart installs the namespace-scoped checkpoint/restore infrastructure used by Dynamo:
106

11-
**Note:**
12-
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
13-
- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
7+
- `snapshot-agent` DaemonSet on GPU nodes
8+
- `snapshot-pvc` checkpoint storage, or wiring to an existing PVC
9+
- namespace-scoped RBAC
10+
- the seccomp profile required by CRIU
1411

15-
## Prerequisites
12+
Snapshot storage is namespace-local. Install this chart in every namespace where you want checkpoint and restore.
1613

17-
⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.
14+
## Prerequisites
1815

1916
- Kubernetes 1.21+
20-
- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images
21-
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
22-
- NVIDIA driver 580.xx or newer on the target GPU nodes
23-
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
24-
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
25-
- RWX (ReadWriteMany) storage class for multi-node deployments
26-
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
27-
28-
## Installation
17+
- x86_64 GPU nodes
18+
- NVIDIA driver 580.xx or newer
19+
- containerd runtime
20+
- a cluster where a privileged DaemonSet with `hostPID`, `hostIPC`, and `hostNetwork` is acceptable
21+
- Dynamo Platform already installed, with operator checkpointing enabled
2922

30-
> **Note:** The Dynamo Snapshot Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source.
23+
The platform/operator configuration must point at the same checkpoint storage that this chart installs:
3124

32-
### Building from Source
33-
34-
```bash
35-
# Set environment
36-
export NAMESPACE=my-team # Your target namespace
37-
export DOCKER_SERVER=your-registry.com/ # Your container registry
38-
export IMAGE_TAG=latest
39-
40-
# Build Dynamo Snapshot agent image (amd64 only)
41-
cd deploy/snapshot
42-
docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
43-
docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG
44-
cd -
45-
46-
# Install Dynamo Snapshot chart with custom image
47-
helm install snapshot ./deploy/helm/charts/snapshot/ \
48-
--namespace ${NAMESPACE} \
49-
--create-namespace \
50-
--set daemonset.image.repository=${DOCKER_SERVER}/snapshot-agent \
51-
--set daemonset.image.tag=${IMAGE_TAG} \
52-
--set daemonset.imagePullSecrets[0].name=your-registry-secret
25+
```yaml
26+
dynamo-operator:
27+
checkpoint:
28+
enabled: true
29+
storage:
30+
type: pvc
31+
pvc:
32+
pvcName: snapshot-pvc
33+
basePath: /checkpoints
5334
```
5435
55-
## Configuration
56-
57-
See `values.yaml` for all configuration options.
36+
Cross-node restore requires a shared `ReadWriteMany` storage class. The chart defaults to `storage.pvc.accessMode=ReadWriteMany`.
5837

59-
### Key Configuration Options
38+
For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC.
6039

61-
| Parameter | Description | Default |
62-
|-----------|-------------|---------|
63-
| `storage.type` | Storage type: `pvc` (only supported), `s3` and `oci` planned | `pvc` |
64-
| `storage.pvc.create` | Create a new PVC | `true` |
65-
| `storage.pvc.name` | PVC name (must match operator config) | `snapshot-pvc` |
66-
| `storage.pvc.size` | PVC size | `100Gi` |
67-
| `storage.pvc.storageClass` | Storage class name | `""` (default) |
68-
| `daemonset.image.repository` | DaemonSet image repository | `nvcr.io/nvidian/dynamo-dev/snapshot-agent` |
69-
| `daemonset.snapshotLogLevel` | Snapshot agent and nsrestore log level (`trace`, `debug`, `info`, `warn`, `error`) | `info` |
70-
| `daemonset.nodeSelector` | Node selector for GPU nodes | `nvidia.com/gpu.present: "true"` |
71-
| `config.checkpoint.criu.ghostLimit` | CRIU ghost file size limit in bytes | `536870912` (512MB) |
72-
| `config.checkpoint.criu.logLevel` | CRIU logging verbosity (0-4) | `4` |
73-
| `rbac.namespaceRestricted` | Use namespace-scoped RBAC | `true` |
40+
## Minimal Install
7441

75-
## Usage
76-
77-
After installing this chart, enable checkpointing in your DynamoGraphDeployment:
78-
79-
```yaml
80-
apiVersion: nvidia.com/v1alpha1
81-
kind: DynamoGraphDeployment
82-
metadata:
83-
name: my-model
84-
namespace: my-team
85-
spec:
86-
services:
87-
worker:
88-
checkpoint:
89-
enabled: true
90-
mode: auto
91-
identity:
92-
model: Qwen/Qwen3-0.6B
93-
backendFramework: vllm
94-
```
95-
96-
## Multi-Namespace Deployment
97-
98-
To enable checkpointing in multiple namespaces, install this chart in each namespace:
42+
This is the smallest Helm install that creates the checkpoint PVC and the DaemonSet:
9943

10044
```bash
101-
# Namespace A
102-
helm install snapshot nvidia/snapshot -n team-a
103-
104-
# Namespace B
105-
helm install snapshot nvidia/snapshot -n team-b
45+
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
46+
--namespace ${NAMESPACE} \
47+
--create-namespace \
48+
--set storage.pvc.create=true
10649
```
10750

108-
Each namespace will have its own isolated checkpoint storage.
51+
If your cluster does not use a default storage class, also set `storage.pvc.storageClass`.
10952

110-
## Verification
53+
Keep `storage.pvc.accessMode=ReadWriteMany` for this chart layout. The DaemonSet mounts the same PVC on each eligible node, so a shared `ReadWriteOnce` claim only works when the agent runs on one node.
11154

112-
```bash
113-
# Check PVC
114-
kubectl get pvc snapshot-pvc -n my-team
55+
If you already have a PVC, keep the chart in "use existing PVC" mode:
11556

116-
# Check DaemonSet
117-
kubectl get daemonset -n my-team
57+
Do not set `storage.pvc.create=true` when reusing an existing checkpoint PVC.
11858

119-
# Check DaemonSet pods are running
120-
kubectl get pods -n my-team -l app.kubernetes.io/name=snapshot
59+
```bash
60+
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
61+
--namespace ${NAMESPACE} \
62+
--create-namespace \
63+
--set storage.pvc.create=false \
64+
--set storage.pvc.name=my-snapshot-pvc
12165
```
12266

123-
## Uninstallation
67+
## Verify
12468

12569
```bash
126-
helm uninstall snapshot -n my-team
70+
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
71+
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
72+
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot -o wide
12773
```
12874

129-
**Note:** This will NOT delete the PVC by default. To delete the PVC:
75+
## Important Values
13076

131-
```bash
132-
kubectl delete pvc snapshot-pvc -n my-team
133-
```
77+
| Parameter | Meaning | Default |
78+
|-----------|---------|---------|
79+
| `storage.pvc.create` | Create `snapshot-pvc` instead of using an existing PVC | `true` |
80+
| `storage.pvc.name` | PVC name used by the agent and by the operator config | `snapshot-pvc` |
81+
| `storage.pvc.size` | Requested PVC size | `1Ti` |
82+
| `storage.pvc.storageClass` | Storage class name | `""` |
83+
| `storage.pvc.accessMode` | Access mode for the checkpoint PVC | `ReadWriteMany` |
84+
| `storage.pvc.basePath` | Checkpoint root inside the PVC | `/checkpoints` |
85+
| `daemonset.image.repository` | Snapshot agent image repository | `nvcr.io/nvidia/ai-dynamo/snapshot-agent` |
86+
| `daemonset.image.tag` | Snapshot agent image tag | `1.0.0` |
87+
| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` |
13488

135-
## Troubleshooting
89+
See [values.yaml](./values.yaml) for the complete configuration surface.
13690

137-
### DaemonSet pods not starting
91+
## End To End
13892

139-
Check if GPU nodes have the correct labels and runtime class:
93+
Once the chart is installed, use the snapshot guide to deploy a snapshot-capable `DynamoGraphDeployment`, wait for the checkpoint to become ready, and then scale the worker to verify restore:
14094

141-
```bash
142-
kubectl get nodes -l nvidia.com/gpu.present=true
143-
kubectl describe node <node-name> | grep -A 5 "Runtime Class"
144-
```
95+
- [Snapshot](../../../../docs/kubernetes/snapshot.md)
14596

146-
If nodes don't have the `nvidia.com/gpu.present` label, you can add it:
97+
## Uninstall
14798

14899
```bash
149-
kubectl label node <node-name> nvidia.com/gpu.present=true
100+
helm uninstall snapshot -n ${NAMESPACE}
150101
```
151102

152-
### Checkpoint job fails
153-
154-
Check DaemonSet logs:
103+
The chart does not remove checkpoint data automatically. Delete the PVC yourself if you want to remove stored checkpoints:
155104

156105
```bash
157-
kubectl logs -n my-team -l app.kubernetes.io/name=snapshot
106+
kubectl delete pvc snapshot-pvc -n ${NAMESPACE}
158107
```
159108

160-
### PVC not mounting
109+
## Troubleshooting
161110

162-
Check PVC status and events:
111+
If `snapshot-agent` does not schedule:
163112

164113
```bash
165-
kubectl describe pvc snapshot-pvc -n my-team
114+
kubectl get nodes -l nvidia.com/gpu.present=true
115+
kubectl describe daemonset snapshot-agent -n ${NAMESPACE}
116+
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot --all-containers
166117
```
167118

168-
Ensure your storage class supports `ReadWriteMany` access mode for multi-node deployments.
169-
170-
## Related Documentation
171-
172-
- [Dynamo Snapshot Overview](../../../../docs/kubernetes/snapshot/README.md) - Dynamo Snapshot architecture and use cases
173-
- [Dynamo Snapshot with Dynamo Platform](../../../../docs/kubernetes/snapshot/dynamo.md) - Integration guide
174-
175-
## License
119+
If checkpoint creation never becomes ready, verify all three pieces line up:
176120

177-
Apache License 2.0
121+
- the operator has `dynamo-operator.checkpoint.enabled=true`
122+
- the operator PVC name and base path match the snapshot chart values
123+
- the workload uses a snapshot-capable worker image and command

deploy/helm/charts/snapshot/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ storage:
2929
# PVC name - must match operator configuration
3030
name: snapshot-pvc
3131
# PVC size
32-
size: 100Gi
32+
size: 1Ti
3333
# Storage class (leave empty for default)
3434
storageClass: ""
3535
# Access mode - ReadWriteMany required for multi-pod access

docs/index.yml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,8 @@ navigation:
6161
path: kubernetes/autoscaling.md
6262
- page: Inference Gateway (GAIE)
6363
path: kubernetes/inference-gateway.md
64-
- section: Checkpointing
65-
path: kubernetes/snapshot/README.md
66-
contents:
67-
- page: Integration with Dynamo
68-
path: kubernetes/snapshot/dynamo.md
64+
- page: Snapshot
65+
path: kubernetes/snapshot.md
6966
- section: Observability (K8s)
7067
contents:
7168
- page: Metrics

docs/kubernetes/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ Key customization points include:
230230
- **[Operator Documentation](dynamo-operator.md)** - How the platform works
231231
- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
232232
- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
233-
- **[Checkpointing](snapshot/README.md)** - Fast pod startup with checkpoint/restore
233+
- **[Snapshot](snapshot.md)** - Fast pod startup with checkpoint/restore
234234
- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
235235
- **[Logging](observability/logging.md)** - For logging setup
236236
- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment

0 commit comments

Comments
 (0)