Skip to content

Conversation

@ayushdg
Copy link
Contributor

@ayushdg ayushdg commented Oct 3, 2025

Description

Adds initial example slurm scripts for single and multi-node runs.

Usage

N/A

Checklist

  • I am familiar with the Contributing Guide.
  • [N/A] New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@ayushdg
Copy link
Contributor Author

ayushdg commented Oct 3, 2025

@lbliii Could you point me to where I should also update this in the docs?

@ayushdg
Copy link
Contributor Author

ayushdg commented Oct 6, 2025

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

@sarahyurick
Copy link
Contributor

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

Yes I noticed the same thing and was not able to determine the root cause. I did not see it when testing locally either. Maybe we can open an issue if it continues to be a blocker.

########################################################
# Container specific variables
########################################################
: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"
: "${IMAGE:=nvcr.io/nvidia/nemo-curator}"

Should work? We could also add a comment saying that this script is for 25.09 and above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to latest. Chatting with @thomasdhc we might start adding a latest tag in addition to explicit version which will help here.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
RAY_CLIENT_ADDRESS=$HEAD_NODE_IP:$CLIENT_PORT
export RAY_GCS_ADDRESS
export RAY_CLIENT_ADDRESS
export RAY_ADDRESS="ray://$RAY_CLIENT_ADDRESS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change this to

export RAY_ADDRESS= RAY_GCS_ADDRESS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that doesn't work in the current slurm setup because a different container on the head node is started up and that fails due to not finding a file in the /tmp directory that ray usually creates. Similar to this: #1174 (comment).

The higher level question is connecting to a ray cluster with the client server port is also a valid way to connect to a remote cluster. Any ideas why that isn't working here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured out a solution that uses the ray job submission API instead that seems to work.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +108 to +112
NODES=${NODES:-$(scontrol show hostnames $(sacct -j ${JOB_ID} -X --json | jq -r .jobs[0].nodes))}
NODES=(${NODES})

HEAD_NODE_NAME=${NODES[0]}
HEAD_NODE_IP=$(srun --jobid ${JOB_ID} --nodes=1 --ntasks=1 -w "$HEAD_NODE_NAME" bash -c "hostname --ip-address")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard dependency on jq/sacct

Node discovery uses sacct ... --json | jq ... (ray-sbatch-job.sh:108) but neither jq nor sacct is guaranteed to exist on SLURM clusters (and sacct often requires accounting enabled). When either command is missing/fails, NODES becomes empty, HEAD_NODE_NAME is empty (ray-sbatch-job.sh:111), and subsequent srun -w calls will fail in confusing ways. Consider switching to scontrol show hostnames "$SLURM_NODELIST" (available during allocations) or failing fast with a clear message when node discovery fails.

Comment on lines +111 to +113
HEAD_NODE_NAME=${NODES[0]}
HEAD_NODE_IP=$(srun --jobid ${JOB_ID} --nodes=1 --ntasks=1 -w "$HEAD_NODE_NAME" bash -c "hostname --ip-address")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Head IP selection can break

HEAD_NODE_IP is derived from hostname --ip-address (ray-sbatch-job.sh:112), which can return multiple addresses (space-separated) or an address on the wrong interface; Ray expects a single reachable IP. In those cases --node-ip-address ${HEAD_NODE_IP} will be invalid and workers will fail to connect. It’s safer to select a single address (e.g., via hostname -I | awk '{print $1}' or site-specific interface selection) or document that clusters must have hostname --ip-address returning exactly one usable IP.

Comment on lines +205 to +211
srun \
--nodes=1 \
--overlap \
-w ${HEAD_NODE_NAME} \
--container-image=$IMAGE \
--container-mounts=$CONTAINER_MOUNTS \
bash -c "ray job submit --address $RAY_DASHBOARD_ADDRESS --submission-id=$JOB_ID -- $RUN_COMMAND"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job submission URL mismatch

RAY_DASHBOARD_ADDRESS is exported as http://$HEAD_NODE_IP:$DASH_PORT (ray-sbatch-job.sh:120), but ray job submit --address expects the Ray Jobs server address and typically uses the dashboard host/port without the http:// scheme in CLI examples. Passing a URL with scheme can cause ray job submit to error depending on Ray version. Consider exporting RAY_DASHBOARD_ADDRESS as $HEAD_NODE_IP:$DASH_PORT (and maybe a separate RAY_DASHBOARD_URL for humans), or ensure the script uses the exact format Ray CLI accepts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants