-
Notifications
You must be signed in to change notification settings - Fork 215
Initial slurm deployment scripts #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
|
@lbliii Could you point me to where I should also update this in the docs? |
|
@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes. |
Yes I noticed the same thing and was not able to determine the root cause. I did not see it when testing locally either. Maybe we can open an issue if it continues to be a blocker. |
| ######################################################## | ||
| # Container specific variables | ||
| ######################################################## | ||
| : "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| : "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}" | |
| : "${IMAGE:=nvcr.io/nvidia/nemo-curator}" |
Should work? We could also add a comment saying that this script is for 25.09 and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to latest. Chatting with @thomasdhc we might start adding a latest tag in addition to explicit version which will help here.
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
| RAY_CLIENT_ADDRESS=$HEAD_NODE_IP:$CLIENT_PORT | ||
| export RAY_GCS_ADDRESS | ||
| export RAY_CLIENT_ADDRESS | ||
| export RAY_ADDRESS="ray://$RAY_CLIENT_ADDRESS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change this to
export RAY_ADDRESS= RAY_GCS_ADDRESS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that doesn't work in the current slurm setup because a different container on the head node is started up and that fails due to not finding a file in the /tmp directory that ray usually creates. Similar to this: #1174 (comment).
The higher level question is connecting to a ray cluster with the client server port is also a valid way to connect to a remote cluster. Any ideas why that isn't working here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured out a solution that uses the ray job submission API instead that seems to work.
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
| NODES=${NODES:-$(scontrol show hostnames $(sacct -j ${JOB_ID} -X --json | jq -r .jobs[0].nodes))} | ||
| NODES=(${NODES}) | ||
|
|
||
| HEAD_NODE_NAME=${NODES[0]} | ||
| HEAD_NODE_IP=$(srun --jobid ${JOB_ID} --nodes=1 --ntasks=1 -w "$HEAD_NODE_NAME" bash -c "hostname --ip-address") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard dependency on jq/sacct
Node discovery uses sacct ... --json | jq ... (ray-sbatch-job.sh:108) but neither jq nor sacct is guaranteed to exist on SLURM clusters (and sacct often requires accounting enabled). When either command is missing/fails, NODES becomes empty, HEAD_NODE_NAME is empty (ray-sbatch-job.sh:111), and subsequent srun -w calls will fail in confusing ways. Consider switching to scontrol show hostnames "$SLURM_NODELIST" (available during allocations) or failing fast with a clear message when node discovery fails.
| HEAD_NODE_NAME=${NODES[0]} | ||
| HEAD_NODE_IP=$(srun --jobid ${JOB_ID} --nodes=1 --ntasks=1 -w "$HEAD_NODE_NAME" bash -c "hostname --ip-address") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Head IP selection can break
HEAD_NODE_IP is derived from hostname --ip-address (ray-sbatch-job.sh:112), which can return multiple addresses (space-separated) or an address on the wrong interface; Ray expects a single reachable IP. In those cases --node-ip-address ${HEAD_NODE_IP} will be invalid and workers will fail to connect. It’s safer to select a single address (e.g., via hostname -I | awk '{print $1}' or site-specific interface selection) or document that clusters must have hostname --ip-address returning exactly one usable IP.
| srun \ | ||
| --nodes=1 \ | ||
| --overlap \ | ||
| -w ${HEAD_NODE_NAME} \ | ||
| --container-image=$IMAGE \ | ||
| --container-mounts=$CONTAINER_MOUNTS \ | ||
| bash -c "ray job submit --address $RAY_DASHBOARD_ADDRESS --submission-id=$JOB_ID -- $RUN_COMMAND" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job submission URL mismatch
RAY_DASHBOARD_ADDRESS is exported as http://$HEAD_NODE_IP:$DASH_PORT (ray-sbatch-job.sh:120), but ray job submit --address expects the Ray Jobs server address and typically uses the dashboard host/port without the http:// scheme in CLI examples. Passing a URL with scheme can cause ray job submit to error depending on Ray version. Consider exporting RAY_DASHBOARD_ADDRESS as $HEAD_NODE_IP:$DASH_PORT (and maybe a separate RAY_DASHBOARD_URL for humans), or ensure the script uses the exact format Ray CLI accepts.
Description
Adds initial example slurm scripts for single and multi-node runs.
Usage
N/A
Checklist