Skip to content

Commit 0c0d237

Browse files
authored
Adding sbatch script for multi-node deployment support (#336)
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
1 parent e5b51b1 commit 0c0d237

File tree

6 files changed

+505
-17
lines changed

6 files changed

+505
-17
lines changed

docs/llm/automodel/automodel-ray.md

Lines changed: 59 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ Use the ``query_ray_deployment.py`` script to test your deployed model:
165165
3. Available parameters for testing:
166166
- ``--host``: Host address of the Ray Serve server. Default is 0.0.0.0.
167167
- ``--port``: Port number of the Ray Serve server. Default is 1024.
168-
- ``--model_id``: Identifier for the model in the API responses. Default is "nemo-model".
168+
- ``--model_id``: Identifier for the model in the API responses. Default is ``nemo-model``.
169169

170170
### Configure Advanced Deployments
171171

@@ -230,4 +230,61 @@ curl -X POST http://localhost:1024/v1/completions/ \
230230
5. **GPU Configuration Errors**: Ensure ``--num_gpus`` = ``--num_replicas`` × ``--num_gpus_per_replica``.
231231
6. **CUDA Device Mismatch**: Make sure the number of devices in ``--cuda_visible_devices`` equals ``--num_gpus``.
232232

233-
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
233+
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
234+
235+
### Multi-node on SLURM using ray.sub
236+
237+
Use `scripts/deploy/utils/ray.sub` to bring up a Ray cluster across multiple SLURM nodes and run your AutoModel deployment automatically. This script starts a Ray head and workers, manages ports, and launches a driver command when the cluster is ready.
238+
239+
- **Script location**: `scripts/deploy/utils/ray.sub`
240+
- **Upstream reference**: See the NeMo RL cluster setup doc for background on this pattern: [NVIDIA-NeMo RL cluster guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/cluster.md)
241+
242+
#### Prerequisites
243+
244+
- SLURM with container support for `srun --container-image` and `--container-mounts`.
245+
- A container image that includes Export-Deploy at `/opt/Export-Deploy`.
246+
- Any model access/auth if required (e.g., `huggingface-cli login` or `HF_TOKEN`).
247+
248+
#### Quick start (2 nodes, 16 GPUs total)
249+
250+
1) Set environment variables used by `ray.sub`:
251+
252+
```bash
253+
export CONTAINER=nvcr.io/nvidia/nemo:vr
254+
export MOUNTS="${PWD}/:/opt/checkpoints/"
255+
export GPUS_PER_NODE=8
256+
257+
# Driver command to run after the cluster is ready (multi-node AutoModel deployment)
258+
export COMMAND="python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py --model_path meta-llama/Llama-3.2-1B --model_id llama --num_replicas 16 --num_gpus 16 --num_gpus_per_replica 1"
259+
```
260+
261+
2) Submit the job:
262+
263+
```bash
264+
sbatch --nodes=2 --account <ACCOUNT> --partition <PARTITION> \
265+
--job-name automodel-ray --time 01:00:00 \
266+
/opt/Export-Deploy/scripts/deploy/utils/ray.sub
267+
```
268+
269+
The script will:
270+
- Start a Ray head on node 0 and one Ray worker per remaining node
271+
- Wait until all nodes register their resources
272+
- Launch the `COMMAND` on the head node (driver) once the cluster is healthy
273+
274+
3) Attaching and monitoring:
275+
- Logs: `$SLURM_SUBMIT_DIR/<jobid>-logs/` contains `ray-head.log` and `ray-worker-<n>.log`.
276+
- Interactive shell: the job creates `<jobid>-attach.sh`. For head: `bash <jobid>-attach.sh`. For worker i: `bash <jobid>-attach.sh i`.
277+
- Ray status: once attached to the head container, run `ray status`.
278+
279+
4) Query the deployment (from within the head container):
280+
281+
```bash
282+
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
283+
--model_id llama --host 0.0.0.0 --port 1024
284+
```
285+
286+
#### Notes
287+
288+
- Set `--num_gpus` in the deploy command to the total GPUs across all nodes; ensure `--num_gpus = --num_replicas × --num_gpus_per_replica`.
289+
- If your cluster uses GRES, `ray.sub` auto-detects and sets `--gres=gpu:<GPUS_PER_NODE>`; ensure `GPUS_PER_NODE` matches the node GPU count.
290+
- You usually do not need to set `--cuda_visible_devices` for multi-node; Ray workers handle per-node visibility.

docs/llm/nemo_models/in-framework-ray.md

Lines changed: 65 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This section demonstrates how to deploy NeMo LLM models using Ray Serve (referred to as 'Ray for NeMo Models'). Ray deployment support provides scalable and flexible deployment for NeMo models, offering features such as automatic scaling, load balancing, and multi-replica deployment with support for advanced parallelism strategies.
44

5-
**Note:** Currently, only single-node deployment is supported.
5+
**Note:** Single-node examples are shown below. For multi-node clusters managed by SLURM, you can deploy across nodes using the `ray.sub` helper described in the section "Multi-node on SLURM using ray.sub".
66

77
## Quick Example
88

@@ -73,13 +73,13 @@ Follow these steps to deploy your NeMo model on Ray Serve:
7373

7474
Available Parameters:
7575

76-
- ``--nemo_checkpoint``: Path to the .nemo checkpoint file (required).
76+
- ``--nemo_checkpoint``: Path to the NeMo checkpoint file (required).
7777
- ``--num_gpus``: Number of GPUs to use per node. Default is 1.
7878
- ``--tensor_model_parallel_size``: Size of the tensor model parallelism. Default is 1.
7979
- ``--pipeline_model_parallel_size``: Size of the pipeline model parallelism. Default is 1.
8080
- ``--expert_model_parallel_size``: Size of the expert model parallelism. Default is 1.
8181
- ``--context_parallel_size``: Size of the context parallelism. Default is 1.
82-
- ``--model_id``: Identifier for the model in the API responses. Default is "nemo-model".
82+
- ``--model_id``: Identifier for the model in the API responses. Default is ``nemo-model``.
8383
- ``--host``: Host address to bind the Ray Serve server to. Default is 0.0.0.0.
8484
- ``--port``: Port number to use for the Ray Serve server. Default is 1024.
8585
- ``--num_cpus``: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.
@@ -91,7 +91,7 @@ Follow these steps to deploy your NeMo model on Ray Serve:
9191
- ``--num_replicas``: Number of replicas for the deployment. Default is 1.
9292
- ``--legacy_ckpt``: Whether to use legacy checkpoint format.
9393

94-
3. To use a different model, modify the ``--nemo_checkpoint`` parameter with the path to your .nemo checkpoint file.
94+
3. To use a different model, modify the ``--nemo_checkpoint`` parameter with the path to your NeMo checkpoint file.
9595

9696

9797
### Configure Model Parallelism
@@ -232,7 +232,7 @@ Use the ``query_ray_deployment.py`` script to test your deployed NeMo model:
232232
3. Available parameters for testing:
233233
- ``--host``: Host address of the Ray Serve server. Default is 0.0.0.0.
234234
- ``--port``: Port number of the Ray Serve server. Default is 1024.
235-
- ``--model_id``: Identifier for the model in the API responses. Default is "nemo-model".
235+
- ``--model_id``: Identifier for the model in the API responses. Default is ``nemo-model``.
236236

237237
### Configure Advanced Deployments
238238

@@ -299,4 +299,63 @@ curl -X POST http://localhost:1024/v1/completions/ \
299299

300300
**Note:** Only NeMo 2.0 checkpoints are supported by default. For older checkpoints, use the ``--legacy_ckpt`` flag.
301301

302-
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
302+
For more information on Ray Serve, visit the [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html).
303+
304+
### Multi-node on SLURM using ray.sub
305+
306+
Use `scripts/deploy/utils/ray.sub` to bring up a Ray cluster across multiple SLURM nodes and run your in-framework NeMo deployment automatically. This script configures the Ray head and workers, handles ports, and can optionally run a driver command once the cluster is online.
307+
308+
- **Script location**: `scripts/deploy/utils/ray.sub`
309+
- **Upstream reference**: See the NeMo RL cluster setup doc for background on this pattern: [NVIDIA-NeMo RL cluster guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/cluster.md)
310+
311+
#### Prerequisites
312+
313+
- A SLURM cluster with container support for `srun --container-image` and `--container-mounts`.
314+
- A container image that includes Export-Deploy at `/opt/Export-Deploy` and the needed dependencies.
315+
- A `.nemo` checkpoint accessible on the cluster filesystem.
316+
317+
#### Quick start (2 nodes, 16 GPUs total)
318+
319+
1) Set environment variables to parameterize `ray.sub` (these are read by the script at submission time):
320+
321+
```bash
322+
export CONTAINER=nvcr.io/nvidia/nemo:vr
323+
export MOUNTS="${PWD}/:/opt/checkpoints/"
324+
325+
# Optional tuning
326+
export GPUS_PER_NODE=8 # default 8; set to your node GPU count
327+
328+
# Driver command to run after the cluster is ready (multi-node NeMo deployment)
329+
export COMMAND="python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py --nemo_checkpoint /opt/checkpoints/model.nemo --model_id llama --num_replicas 16 --num_gpus 16"
330+
```
331+
332+
2) Submit the job (you can override SBATCH directives on the command line):
333+
334+
```bash
335+
sbatch --nodes=2 --account <ACCOUNT> --partition <PARTITION> \
336+
--job-name nemo-ray --time 01:00:00 \
337+
/opt/Export-Deploy/scripts/deploy/utils/ray.sub
338+
```
339+
340+
The script will:
341+
- Start a Ray head on node 0 and one Ray worker per remaining node
342+
- Wait until all nodes register their resources
343+
- Launch the `COMMAND` on the head node (driver) once the cluster is healthy
344+
345+
3) Attaching and monitoring:
346+
- Logs: `$SLURM_SUBMIT_DIR/<jobid>-logs/` contains `ray-head.log`, `ray-worker-<n>.log`, and (if set) synced Ray logs.
347+
- Interactive shell: the job creates `<jobid>-attach.sh`. For head: `bash <jobid>-attach.sh`. For worker i: `bash <jobid>-attach.sh i`.
348+
- Ray status: once attached to the head container, run `ray status`.
349+
350+
4) Query the deployment (from within the head container):
351+
352+
```bash
353+
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
354+
--model_id llama --host 0.0.0.0 --port 1024
355+
```
356+
357+
#### Notes
358+
359+
- Set `--num_gpus` in the deploy command to the total GPUs across all nodes; adjust `--num_replicas` and model parallel sizes per your topology.
360+
- If your cluster uses GRES, `ray.sub` auto-detects and sets `--gres=gpu:<GPUS_PER_NODE>`; ensure `GPUS_PER_NODE` matches the node’s GPU count.
361+
- You can leave `--cuda_visible_devices` unset for multi-node runs; per-node visibility is managed by Ray workers.

nemo_deploy/nlp/megatronllm_deployable_ray.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515

1616
import logging
1717
import os
18+
import random
1819
import time
1920
from typing import Any, Dict, Optional
2021

@@ -173,7 +174,7 @@ def __init__(
173174

174175
# Pre-allocate master port to avoid race conditions between workers
175176
# Use replica-specific port to avoid conflicts between replicas
176-
base_port = 29500 + (replica_id % 100) * 100
177+
base_port = random.randint(29500, 29999) + (replica_id % 100) * 100
177178
master_port = str(find_available_port(base_port, ray._private.services.get_node_ip_address()))
178179
LOGGER.info(f"Replica {replica_id} - Pre-allocated master port: {master_port}")
179180

scripts/deploy/nlp/deploy_ray_inframework.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ def parse_args():
103103
parser.add_argument(
104104
"--cuda_visible_devices",
105105
type=str,
106-
default="0,1",
106+
default=None,
107107
help="Comma-separated list of CUDA visible devices",
108108
)
109109
parser.add_argument(
@@ -170,17 +170,19 @@ def main():
170170
"""Main function to deploy a Megatron model using Ray."""
171171
args = parse_args()
172172
# Initialize Ray deployment with updated DeployRay class
173+
runtime_env = {}
174+
if args.cuda_visible_devices is not None:
175+
runtime_env["env_vars"] = {
176+
"CUDA_VISIBLE_DEVICES": args.cuda_visible_devices,
177+
}
178+
173179
ray_deployer = DeployRay(
174180
num_cpus=args.num_cpus,
175181
num_gpus=args.num_gpus,
176182
include_dashboard=args.include_dashboard,
177183
host=args.host,
178184
port=args.port,
179-
runtime_env={
180-
"env_vars": {
181-
"CUDA_VISIBLE_DEVICES": args.cuda_visible_devices,
182-
}
183-
},
185+
runtime_env=runtime_env,
184186
)
185187

186188
# Deploy the inframework model using the updated API

0 commit comments

Comments
 (0)