You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
6.**CUDA Device Mismatch**: Make sure the number of devices in ``--cuda_visible_devices`` equals ``--num_gpus``.
232
232
233
-
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
233
+
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
234
+
235
+
### Multi-node on SLURM using ray.sub
236
+
237
+
Use `scripts/deploy/utils/ray.sub` to bring up a Ray cluster across multiple SLURM nodes and run your AutoModel deployment automatically. This script starts a Ray head and workers, manages ports, and launches a driver command when the cluster is ready.
-**Upstream reference**: See the NeMo RL cluster setup doc for background on this pattern: [NVIDIA-NeMo RL cluster guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/cluster.md)
241
+
242
+
#### Prerequisites
243
+
244
+
- SLURM with container support for `srun --container-image` and `--container-mounts`.
245
+
- A container image that includes Export-Deploy at `/opt/Export-Deploy`.
246
+
- Any model access/auth if required (e.g., `huggingface-cli login` or `HF_TOKEN`).
247
+
248
+
#### Quick start (2 nodes, 16 GPUs total)
249
+
250
+
1) Set environment variables used by `ray.sub`:
251
+
252
+
```bash
253
+
export CONTAINER=nvcr.io/nvidia/nemo:vr
254
+
export MOUNTS="${PWD}/:/opt/checkpoints/"
255
+
export GPUS_PER_NODE=8
256
+
257
+
# Driver command to run after the cluster is ready (multi-node AutoModel deployment)
Copy file name to clipboardExpand all lines: docs/llm/nemo_models/in-framework-ray.md
+65-6Lines changed: 65 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
This section demonstrates how to deploy NeMo LLM models using Ray Serve (referred to as 'Ray for NeMo Models'). Ray deployment support provides scalable and flexible deployment for NeMo models, offering features such as automatic scaling, load balancing, and multi-replica deployment with support for advanced parallelism strategies.
4
4
5
-
**Note:**Currently, only single-node deployment is supported.
5
+
**Note:**Single-node examples are shown below. For multi-node clusters managed by SLURM, you can deploy across nodes using the `ray.sub` helper described in the section "Multi-node on SLURM using ray.sub".
6
6
7
7
## Quick Example
8
8
@@ -73,13 +73,13 @@ Follow these steps to deploy your NeMo model on Ray Serve:
73
73
74
74
Available Parameters:
75
75
76
-
-``--nemo_checkpoint``: Path to the .nemo checkpoint file (required).
76
+
-``--nemo_checkpoint``: Path to the NeMo checkpoint file (required).
77
77
-``--num_gpus``: Number of GPUs to use per node. Default is 1.
78
78
-``--tensor_model_parallel_size``: Size of the tensor model parallelism. Default is 1.
79
79
-``--pipeline_model_parallel_size``: Size of the pipeline model parallelism. Default is 1.
80
80
-``--expert_model_parallel_size``: Size of the expert model parallelism. Default is 1.
81
81
-``--context_parallel_size``: Size of the context parallelism. Default is 1.
82
-
-``--model_id``: Identifier for the model in the API responses. Default is "nemo-model".
82
+
-``--model_id``: Identifier for the model in the API responses. Default is ``nemo-model``.
83
83
-``--host``: Host address to bind the Ray Serve server to. Default is 0.0.0.0.
84
84
-``--port``: Port number to use for the Ray Serve server. Default is 1024.
85
85
-``--num_cpus``: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.
@@ -91,7 +91,7 @@ Follow these steps to deploy your NeMo model on Ray Serve:
91
91
-``--num_replicas``: Number of replicas for the deployment. Default is 1.
92
92
-``--legacy_ckpt``: Whether to use legacy checkpoint format.
93
93
94
-
3. To use a different model, modify the ``--nemo_checkpoint`` parameter with the path to your .nemo checkpoint file.
94
+
3. To use a different model, modify the ``--nemo_checkpoint`` parameter with the path to your NeMo checkpoint file.
95
95
96
96
97
97
### Configure Model Parallelism
@@ -232,7 +232,7 @@ Use the ``query_ray_deployment.py`` script to test your deployed NeMo model:
232
232
3. Available parameters for testing:
233
233
-``--host``: Host address of the Ray Serve server. Default is 0.0.0.0.
234
234
-``--port``: Port number of the Ray Serve server. Default is 1024.
235
-
-``--model_id``: Identifier for the model in the API responses. Default is "nemo-model".
235
+
-``--model_id``: Identifier for the model in the API responses. Default is ``nemo-model``.
236
236
237
237
### Configure Advanced Deployments
238
238
@@ -299,4 +299,63 @@ curl -X POST http://localhost:1024/v1/completions/ \
299
299
300
300
**Note:** Only NeMo 2.0 checkpoints are supported by default. For older checkpoints, use the ``--legacy_ckpt`` flag.
301
301
302
-
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
302
+
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
303
+
304
+
### Multi-node on SLURM using ray.sub
305
+
306
+
Use `scripts/deploy/utils/ray.sub` to bring up a Ray cluster across multiple SLURM nodes and run your in-framework NeMo deployment automatically. This script configures the Ray head and workers, handles ports, and can optionally run a driver command once the cluster is online.
-**Upstream reference**: See the NeMo RL cluster setup doc for background on this pattern: [NVIDIA-NeMo RL cluster guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/cluster.md)
310
+
311
+
#### Prerequisites
312
+
313
+
- A SLURM cluster with container support for `srun --container-image` and `--container-mounts`.
314
+
- A container image that includes Export-Deploy at `/opt/Export-Deploy` and the needed dependencies.
315
+
- A `.nemo` checkpoint accessible on the cluster filesystem.
316
+
317
+
#### Quick start (2 nodes, 16 GPUs total)
318
+
319
+
1) Set environment variables to parameterize `ray.sub` (these are read by the script at submission time):
320
+
321
+
```bash
322
+
export CONTAINER=nvcr.io/nvidia/nemo:vr
323
+
export MOUNTS="${PWD}/:/opt/checkpoints/"
324
+
325
+
# Optional tuning
326
+
export GPUS_PER_NODE=8 # default 8; set to your node GPU count
327
+
328
+
# Driver command to run after the cluster is ready (multi-node NeMo deployment)
0 commit comments