Skip to content

Commit 8fb070f

Browse files
authored
fix: updated stale cluster.md (#30)
Signed-off-by: Terry Kong <terryk@nvidia.com>
1 parent 0524b71 commit 8fb070f

File tree

2 files changed

+11
-6
lines changed

2 files changed

+11
-6
lines changed

docs/cluster.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,13 @@ export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
2323
# Run from the root of NeMo-Reinforcer repo
2424
NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0)
2525

26-
COMMAND="bash -c 'uv pip install -e .; uv run ./examples/run_grpo.py'" \
26+
COMMAND="uv pip install -e .; uv run ./examples/run_grpo_math.py" \
2727
RAY_DEDUP_LOGS=0 \
2828
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
2929
CONTAINER=YOUR_CONTAINER \
3030
MOUNTS="$PWD:$PWD" \
3131
sbatch \
32-
--nodes=$((NUM_ACTOR_NODES + 1)) \
32+
--nodes=${NUM_ACTOR_NODES} \
3333
--account=YOUR_ACCOUNT \
3434
--job-name=YOUR_JOBNAME \
3535
--partition=YOUR_PARTITION \
@@ -52,6 +52,11 @@ tail -f 1980204-logs/ray-driver.log
5252
```
5353

5454
### Interactive Launching
55+
56+
:::{tip}
57+
A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the SLURM job queue. This means during debugging sessions, you can avoid submitting a new `sbatch` command each time and instead debug and re-submit your Reinforcer job directly from the interactive session.
58+
:::
59+
5560
To run interactively, launch the same command as the [Batched Job Submission](#batched-job-submission) except omit the `COMMAND` line:
5661
```sh
5762
# Run from the root of NeMo-Reinforcer repo
@@ -62,7 +67,7 @@ UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
6267
CONTAINER=YOUR_CONTAINER \
6368
MOUNTS="$PWD:$PWD" \
6469
sbatch \
65-
--nodes=$((NUM_ACTOR_NODES + 1)) \
70+
--nodes=${NUM_ACTOR_NODES} \
6671
--account=YOUR_ACCOUNT \
6772
--job-name=YOUR_JOBNAME \
6873
--partition=YOUR_PARTITION \
@@ -81,9 +86,9 @@ bash 1980204-attach.sh
8186
```
8287
Now that you are on the head node, you can launch the command like so:
8388
```sh
84-
uv venv -p python3.12.9 .venv
89+
uv venv .venv
8590
uv pip install -e .
86-
uv run ./examples/run_grpo.py
91+
uv run ./examples/run_grpo_math.py
8792
```
8893

8994
## Kubernetes

ray.sub

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
#SBATCH --gres=gpu:8
1010

1111

12-
set -eou pipefail
12+
set -eoux pipefail
1313

1414
########################################################
1515
# User defined variables

0 commit comments

Comments
 (0)