|
1 | 1 | # Dynamo Snapshot Helm Chart |
2 | 2 |
|
3 | | -> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations. |
| 3 | +> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in beta/preview. The DaemonSet runs in privileged mode to perform CRIU checkpoint and restore operations. |
4 | 4 |
|
5 | | -This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including: |
6 | | -- Persistent Volume Claim (PVC) for checkpoint storage |
7 | | -- DaemonSet running the CRIU checkpoint agent |
8 | | -- RBAC resources (ServiceAccount, Role, RoleBinding) |
9 | | -- Seccomp profile for blocking io_uring syscalls |
| 5 | +This chart installs the namespace-scoped checkpoint/restore infrastructure used by Dynamo: |
10 | 6 |
|
11 | | -**Note:** |
12 | | -- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC |
13 | | -- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned) |
| 7 | +- `snapshot-agent` DaemonSet on GPU nodes |
| 8 | +- `snapshot-pvc` checkpoint storage, or wiring to an existing PVC |
| 9 | +- namespace-scoped RBAC |
| 10 | +- the seccomp profile required by CRIU |
14 | 11 |
|
15 | | -## Prerequisites |
| 12 | +Snapshot storage is namespace-local. Install this chart in every namespace where you want checkpoint and restore. |
16 | 13 |
|
17 | | -⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable. |
| 14 | +## Prerequisites |
18 | 15 |
|
19 | 16 | - Kubernetes 1.21+ |
20 | | -- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images |
21 | | -- GPU nodes with NVIDIA runtime (`nvidia` runtime class) |
22 | | -- NVIDIA driver 580.xx or newer on the target GPU nodes |
23 | | -- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images) |
24 | | -- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped) |
25 | | -- RWX (ReadWriteMany) storage class for multi-node deployments |
26 | | -- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork) |
27 | | - |
28 | | -## Installation |
| 17 | +- x86_64 GPU nodes |
| 18 | +- NVIDIA driver 580.xx or newer |
| 19 | +- containerd runtime |
| 20 | +- a cluster where a privileged DaemonSet with `hostPID`, `hostIPC`, and `hostNetwork` is acceptable |
| 21 | +- Dynamo Platform already installed, with operator checkpointing enabled |
29 | 22 |
|
30 | | -> **Note:** The Dynamo Snapshot Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source. |
| 23 | +The platform/operator configuration must point at the same checkpoint storage that this chart installs: |
31 | 24 |
|
32 | | -### Building from Source |
33 | | - |
34 | | -```bash |
35 | | -# Set environment |
36 | | -export NAMESPACE=my-team # Your target namespace |
37 | | -export DOCKER_SERVER=your-registry.com/ # Your container registry |
38 | | -export IMAGE_TAG=latest |
39 | | - |
40 | | -# Build Dynamo Snapshot agent image (amd64 only) |
41 | | -cd deploy/snapshot |
42 | | -docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG . |
43 | | -docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG |
44 | | -cd - |
45 | | - |
46 | | -# Install Dynamo Snapshot chart with custom image |
47 | | -helm install snapshot ./deploy/helm/charts/snapshot/ \ |
48 | | - --namespace ${NAMESPACE} \ |
49 | | - --create-namespace \ |
50 | | - --set daemonset.image.repository=${DOCKER_SERVER}/snapshot-agent \ |
51 | | - --set daemonset.image.tag=${IMAGE_TAG} \ |
52 | | - --set daemonset.imagePullSecrets[0].name=your-registry-secret |
| 25 | +```yaml |
| 26 | +dynamo-operator: |
| 27 | + checkpoint: |
| 28 | + enabled: true |
| 29 | + storage: |
| 30 | + type: pvc |
| 31 | + pvc: |
| 32 | + pvcName: snapshot-pvc |
| 33 | + basePath: /checkpoints |
53 | 34 | ``` |
54 | 35 |
|
55 | | -## Configuration |
56 | | - |
57 | | -See `values.yaml` for all configuration options. |
| 36 | +Cross-node restore requires a shared `ReadWriteMany` storage class. The chart defaults to `storage.pvc.accessMode=ReadWriteMany`. |
58 | 37 |
|
59 | | -### Key Configuration Options |
| 38 | +For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC. |
60 | 39 |
|
61 | | -| Parameter | Description | Default | |
62 | | -|-----------|-------------|---------| |
63 | | -| `storage.type` | Storage type: `pvc` (only supported), `s3` and `oci` planned | `pvc` | |
64 | | -| `storage.pvc.create` | Create a new PVC | `true` | |
65 | | -| `storage.pvc.name` | PVC name (must match operator config) | `snapshot-pvc` | |
66 | | -| `storage.pvc.size` | PVC size | `100Gi` | |
67 | | -| `storage.pvc.storageClass` | Storage class name | `""` (default) | |
68 | | -| `daemonset.image.repository` | DaemonSet image repository | `nvcr.io/nvidian/dynamo-dev/snapshot-agent` | |
69 | | -| `daemonset.snapshotLogLevel` | Snapshot agent and nsrestore log level (`trace`, `debug`, `info`, `warn`, `error`) | `info` | |
70 | | -| `daemonset.nodeSelector` | Node selector for GPU nodes | `nvidia.com/gpu.present: "true"` | |
71 | | -| `config.checkpoint.criu.ghostLimit` | CRIU ghost file size limit in bytes | `536870912` (512MB) | |
72 | | -| `config.checkpoint.criu.logLevel` | CRIU logging verbosity (0-4) | `4` | |
73 | | -| `rbac.namespaceRestricted` | Use namespace-scoped RBAC | `true` | |
| 40 | +## Minimal Install |
74 | 41 |
|
75 | | -## Usage |
76 | | - |
77 | | -After installing this chart, enable checkpointing in your DynamoGraphDeployment: |
78 | | - |
79 | | -```yaml |
80 | | -apiVersion: nvidia.com/v1alpha1 |
81 | | -kind: DynamoGraphDeployment |
82 | | -metadata: |
83 | | - name: my-model |
84 | | - namespace: my-team |
85 | | -spec: |
86 | | - services: |
87 | | - worker: |
88 | | - checkpoint: |
89 | | - enabled: true |
90 | | - mode: auto |
91 | | - identity: |
92 | | - model: Qwen/Qwen3-0.6B |
93 | | - backendFramework: vllm |
94 | | -``` |
95 | | -
|
96 | | -## Multi-Namespace Deployment |
97 | | -
|
98 | | -To enable checkpointing in multiple namespaces, install this chart in each namespace: |
| 42 | +This is the smallest Helm install that creates the checkpoint PVC and the DaemonSet: |
99 | 43 |
|
100 | 44 | ```bash |
101 | | -# Namespace A |
102 | | -helm install snapshot nvidia/snapshot -n team-a |
103 | | - |
104 | | -# Namespace B |
105 | | -helm install snapshot nvidia/snapshot -n team-b |
| 45 | +helm upgrade --install snapshot ./deploy/helm/charts/snapshot \ |
| 46 | + --namespace ${NAMESPACE} \ |
| 47 | + --create-namespace \ |
| 48 | + --set storage.pvc.create=true |
106 | 49 | ``` |
107 | 50 |
|
108 | | -Each namespace will have its own isolated checkpoint storage. |
| 51 | +If your cluster does not use a default storage class, also set `storage.pvc.storageClass`. |
109 | 52 |
|
110 | | -## Verification |
| 53 | +Keep `storage.pvc.accessMode=ReadWriteMany` for this chart layout. The DaemonSet mounts the same PVC on each eligible node, so a shared `ReadWriteOnce` claim only works when the agent runs on one node. |
111 | 54 |
|
112 | | -```bash |
113 | | -# Check PVC |
114 | | -kubectl get pvc snapshot-pvc -n my-team |
| 55 | +If you already have a PVC, keep the chart in "use existing PVC" mode: |
115 | 56 |
|
116 | | -# Check DaemonSet |
117 | | -kubectl get daemonset -n my-team |
| 57 | +Do not set `storage.pvc.create=true` when reusing an existing checkpoint PVC. |
118 | 58 |
|
119 | | -# Check DaemonSet pods are running |
120 | | -kubectl get pods -n my-team -l app.kubernetes.io/name=snapshot |
| 59 | +```bash |
| 60 | +helm upgrade --install snapshot ./deploy/helm/charts/snapshot \ |
| 61 | + --namespace ${NAMESPACE} \ |
| 62 | + --create-namespace \ |
| 63 | + --set storage.pvc.create=false \ |
| 64 | + --set storage.pvc.name=my-snapshot-pvc |
121 | 65 | ``` |
122 | 66 |
|
123 | | -## Uninstallation |
| 67 | +## Verify |
124 | 68 |
|
125 | 69 | ```bash |
126 | | -helm uninstall snapshot -n my-team |
| 70 | +kubectl get pvc snapshot-pvc -n ${NAMESPACE} |
| 71 | +kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE} |
| 72 | +kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot -o wide |
127 | 73 | ``` |
128 | 74 |
|
129 | | -**Note:** This will NOT delete the PVC by default. To delete the PVC: |
| 75 | +## Important Values |
130 | 76 |
|
131 | | -```bash |
132 | | -kubectl delete pvc snapshot-pvc -n my-team |
133 | | -``` |
| 77 | +| Parameter | Meaning | Default | |
| 78 | +|-----------|---------|---------| |
| 79 | +| `storage.pvc.create` | Create `snapshot-pvc` instead of using an existing PVC | `true` | |
| 80 | +| `storage.pvc.name` | PVC name used by the agent and by the operator config | `snapshot-pvc` | |
| 81 | +| `storage.pvc.size` | Requested PVC size | `1Ti` | |
| 82 | +| `storage.pvc.storageClass` | Storage class name | `""` | |
| 83 | +| `storage.pvc.accessMode` | Access mode for the checkpoint PVC | `ReadWriteMany` | |
| 84 | +| `storage.pvc.basePath` | Checkpoint root inside the PVC | `/checkpoints` | |
| 85 | +| `daemonset.image.repository` | Snapshot agent image repository | `nvcr.io/nvidia/ai-dynamo/snapshot-agent` | |
| 86 | +| `daemonset.image.tag` | Snapshot agent image tag | `1.0.0` | |
| 87 | +| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` | |
134 | 88 |
|
135 | | -## Troubleshooting |
| 89 | +See [values.yaml](./values.yaml) for the complete configuration surface. |
136 | 90 |
|
137 | | -### DaemonSet pods not starting |
| 91 | +## End To End |
138 | 92 |
|
139 | | -Check if GPU nodes have the correct labels and runtime class: |
| 93 | +Once the chart is installed, use the snapshot guide to deploy a snapshot-capable `DynamoGraphDeployment`, wait for the checkpoint to become ready, and then scale the worker to verify restore: |
140 | 94 |
|
141 | | -```bash |
142 | | -kubectl get nodes -l nvidia.com/gpu.present=true |
143 | | -kubectl describe node <node-name> | grep -A 5 "Runtime Class" |
144 | | -``` |
| 95 | +- [Snapshot](../../../../docs/kubernetes/snapshot.md) |
145 | 96 |
|
146 | | -If nodes don't have the `nvidia.com/gpu.present` label, you can add it: |
| 97 | +## Uninstall |
147 | 98 |
|
148 | 99 | ```bash |
149 | | -kubectl label node <node-name> nvidia.com/gpu.present=true |
| 100 | +helm uninstall snapshot -n ${NAMESPACE} |
150 | 101 | ``` |
151 | 102 |
|
152 | | -### Checkpoint job fails |
153 | | - |
154 | | -Check DaemonSet logs: |
| 103 | +The chart does not remove checkpoint data automatically. Delete the PVC yourself if you want to remove stored checkpoints: |
155 | 104 |
|
156 | 105 | ```bash |
157 | | -kubectl logs -n my-team -l app.kubernetes.io/name=snapshot |
| 106 | +kubectl delete pvc snapshot-pvc -n ${NAMESPACE} |
158 | 107 | ``` |
159 | 108 |
|
160 | | -### PVC not mounting |
| 109 | +## Troubleshooting |
161 | 110 |
|
162 | | -Check PVC status and events: |
| 111 | +If `snapshot-agent` does not schedule: |
163 | 112 |
|
164 | 113 | ```bash |
165 | | -kubectl describe pvc snapshot-pvc -n my-team |
| 114 | +kubectl get nodes -l nvidia.com/gpu.present=true |
| 115 | +kubectl describe daemonset snapshot-agent -n ${NAMESPACE} |
| 116 | +kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot --all-containers |
166 | 117 | ``` |
167 | 118 |
|
168 | | -Ensure your storage class supports `ReadWriteMany` access mode for multi-node deployments. |
169 | | - |
170 | | -## Related Documentation |
171 | | - |
172 | | -- [Dynamo Snapshot Overview](../../../../docs/kubernetes/snapshot/README.md) - Dynamo Snapshot architecture and use cases |
173 | | -- [Dynamo Snapshot with Dynamo Platform](../../../../docs/kubernetes/snapshot/dynamo.md) - Integration guide |
174 | | - |
175 | | -## License |
| 119 | +If checkpoint creation never becomes ready, verify all three pieces line up: |
176 | 120 |
|
177 | | -Apache License 2.0 |
| 121 | +- the operator has `dynamo-operator.checkpoint.enabled=true` |
| 122 | +- the operator PVC name and base path match the snapshot chart values |
| 123 | +- the workload uses a snapshot-capable worker image and command |
0 commit comments