Initial slurm deployment scripts #1168

ayushdg · 2025-10-03T22:39:00Z

Description

Adds initial example slurm scripts for single and multi-node runs.

Usage

N/A

Checklist

I am familiar with the Contributing Guide.
[N/A] New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

ayushdg · 2025-10-03T22:39:55Z

@lbliii Could you point me to where I should also update this in the docs?

tutorials/deployment/slurm/ray-sbatch-job.sh

ayushdg · 2025-10-06T20:37:53Z

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

sarahyurick · 2025-10-06T20:41:59Z

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

Yes I noticed the same thing and was not able to determine the root cause. I did not see it when testing locally either. Maybe we can open an issue if it continues to be a blocker.

sarahyurick · 2025-10-06T20:54:36Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+########################################################
+# Container specific variables
+########################################################
+: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"


Suggested change

: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"

: "${IMAGE:=nvcr.io/nvidia/nemo-curator}"

Should work? We could also add a comment saying that this script is for 25.09 and above.

I've updated this to latest. Chatting with @thomasdhc we might start adding a latest tag in addition to explicit version which will help here.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

abhinavg4 · 2025-10-10T09:44:12Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+RAY_CLIENT_ADDRESS=$HEAD_NODE_IP:$CLIENT_PORT
+export RAY_GCS_ADDRESS
+export RAY_CLIENT_ADDRESS
+export RAY_ADDRESS="ray://$RAY_CLIENT_ADDRESS"


I think we should change this to

export RAY_ADDRESS= RAY_GCS_ADDRESS

Unfortunately that doesn't work in the current slurm setup because a different container on the head node is started up and that fails due to not finding a file in the /tmp directory that ray usually creates. Similar to this: #1174 (comment).

The higher level question is connecting to a ray cluster with the client server port is also a valid way to connect to a remote cluster. Any ideas why that isn't working here?

Figured out a solution that uses the ray job submission API instead that seems to work.

tutorials/deployment/slurm/ray-sbatch-job.sh

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-05T18:17:19Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+NODES=${NODES:-$(scontrol show hostnames $(sacct -j ${JOB_ID} -X --json | jq -r .jobs[0].nodes))}
+NODES=(${NODES})
+
+HEAD_NODE_NAME=${NODES[0]}
+HEAD_NODE_IP=$(srun --jobid ${JOB_ID} --nodes=1 --ntasks=1 -w "$HEAD_NODE_NAME" bash -c "hostname  --ip-address")


Hard dependency on jq/sacct

Node discovery uses sacct ... --json | jq ... (ray-sbatch-job.sh:108) but neither jq nor sacct is guaranteed to exist on SLURM clusters (and sacct often requires accounting enabled). When either command is missing/fails, NODES becomes empty, HEAD_NODE_NAME is empty (ray-sbatch-job.sh:111), and subsequent srun -w calls will fail in confusing ways. Consider switching to scontrol show hostnames "$SLURM_NODELIST" (available during allocations) or failing fast with a clear message when node discovery fails.

greptile-apps · 2026-02-05T18:17:20Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+HEAD_NODE_NAME=${NODES[0]}
+HEAD_NODE_IP=$(srun --jobid ${JOB_ID} --nodes=1 --ntasks=1 -w "$HEAD_NODE_NAME" bash -c "hostname  --ip-address")
+


Head IP selection can break

HEAD_NODE_IP is derived from hostname --ip-address (ray-sbatch-job.sh:112), which can return multiple addresses (space-separated) or an address on the wrong interface; Ray expects a single reachable IP. In those cases --node-ip-address ${HEAD_NODE_IP} will be invalid and workers will fail to connect. It’s safer to select a single address (e.g., via hostname -I | awk '{print $1}' or site-specific interface selection) or document that clusters must have hostname --ip-address returning exactly one usable IP.

greptile-apps · 2026-02-05T18:17:21Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+srun \
+  --nodes=1 \
+  --overlap \
+  -w ${HEAD_NODE_NAME} \
+  --container-image=$IMAGE \
+  --container-mounts=$CONTAINER_MOUNTS \
+    bash -c "ray job submit --address $RAY_DASHBOARD_ADDRESS --submission-id=$JOB_ID -- $RUN_COMMAND"


Job submission URL mismatch

RAY_DASHBOARD_ADDRESS is exported as http://$HEAD_NODE_IP:$DASH_PORT (ray-sbatch-job.sh:120), but ray job submit --address expects the Ray Jobs server address and typically uses the dashboard host/port without the http:// scheme in CLI examples. Passing a URL with scheme can cause ray job submit to error depending on Ray version. Consider exporting RAY_DASHBOARD_ADDRESS as $HEAD_NODE_IP:$DASH_PORT (and maybe a separate RAY_DASHBOARD_URL for humans), or ensure the script uses the exact format Ray CLI accepts.

Initial slurm deployment scripts

d7dd76d

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

copy-pr-bot bot temporarily deployed to test October 3, 2025 22:39 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 3, 2025 22:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 3, 2025 22:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

praateekmahajan reviewed Oct 3, 2025

View reviewed changes

tutorials/deployment/slurm/ray-sbatch-job.sh Show resolved Hide resolved

copy-pr-bot bot had a problem deploying to nemo-ci October 4, 2025 00:00 Failure

copy-pr-bot bot had a problem deploying to nemo-ci October 6, 2025 14:55 Failure

Merge branch 'main' into slurm-scripts

f3d9ea8

sarahyurick reviewed Oct 6, 2025

View reviewed changes

Add xenna_respect_cuda_visible_devices

3222c46

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

federico-dambrosio mentioned this pull request Oct 8, 2025

XennaExecutor - RuntimeError: Ray Client is already connected. when RAY_ADDRESS is set to remote cluster #1174

Closed

abhinavg4 reviewed Oct 10, 2025

View reviewed changes

federico-dambrosio mentioned this pull request Nov 24, 2025

Add SLURM script for launching multi-node Ray clusters with Singularity #1269

Open

3 tasks

Merge branch 'main' of github.com:NVIDIA/NeMo-Curator into slurm-scripts

8c617ab

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

copy-pr-bot bot had a problem deploying to nemo-ci February 5, 2026 18:14 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 18:14 Inactive

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

	: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"
	: "${IMAGE:=nvcr.io/nvidia/nemo-curator}"

		HEAD_NODE_NAME=${NODES[0]}
		HEAD_NODE_IP=$(srun --jobid ${JOB_ID} --nodes=1 --ntasks=1 -w "$HEAD_NODE_NAME" bash -c "hostname --ip-address")

Initial slurm deployment scripts #1168

Are you sure you want to change the base?

Initial slurm deployment scripts #1168

Uh oh!

Conversation

ayushdg commented Oct 3, 2025

Description

Usage

Checklist

Uh oh!

ayushdg commented Oct 3, 2025

Uh oh!

Uh oh!

ayushdg commented Oct 6, 2025

Uh oh!

sarahyurick commented Oct 6, 2025

Uh oh!

sarahyurick Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

ayushdg Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ayushdg Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ayushdg Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants