Skip to content

Commit 200568e

Browse files
guyueh1terrykong
andauthored
feat: Fix nsight profiling file sync for multi-node jobs (#1001)
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
1 parent cbd4b93 commit 200568e

File tree

2 files changed

+32
-1
lines changed

2 files changed

+32
-1
lines changed

docs/nsys-profiling.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ If you are not using model parallelism in Vllm, you should directly refer to `vl
9191

9292
3. **File Location**: Profile files are saved in `/tmp/ray/session*/logs/nsight/` directory on each worker node. Ensure you check both `ls /tmp/ray/session_[0-9]*/logs/nsight` and `ls /tmp/ray/session_latest/logs/nsight` for the profiles, since the "latest" pointer may be stale.
9393

94-
**Note for SLURM users with `ray.sub`**: When using `ray.sub` on SLURM, set `RAY_LOG_SYNC_FREQUENCY=$NUM_SEC` (e.g., `RAY_LOG_SYNC_FREQUENCY=30`) to ensure that the nsight profile files get copied from the container's ephemeral filesystem (`/tmp/ray`) to the persistent `$SLURM_JOB_ID-logs/ray` directory.
94+
**Note for SLURM users with `ray.sub`**: When using `ray.sub` on SLURM, set `RAY_LOG_SYNC_FREQUENCY=$NUM_SEC` (e.g., `RAY_LOG_SYNC_FREQUENCY=30`) to ensure that the nsight profile files get copied from the container's ephemeral filesystem (`/tmp/ray`) to the persistent directory. The header node's files will be synced to ``$SLURM_JOB_ID-logs/ray`, and other nodes' files will be synced to `$SLURM_JOB_ID-logs/ray/$node_ip/` where `$node_ip` is the IP address of the node.
9595

9696
## Analyze Profile Files
9797

ray.sub

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,37 @@ monitor-sidecar() {
312312
}
313313
monitor-sidecar &
314314
315+
# Background process to sync ray logs every $RAY_LOG_SYNC_FREQUENCY seconds
316+
log-sync-sidecar() {
317+
set +x
318+
if [[ -z "$RAY_LOG_SYNC_FREQUENCY" ]]; then
319+
echo "RAY_LOG_SYNC_FREQUENCY is not set, skipping log sync sidecar"
320+
return
321+
fi
322+
mkdir -p $LOG_DIR/ray/$node_i
323+
while true; do
324+
sleep $RAY_LOG_SYNC_FREQUENCY
325+
if ls /tmp/ray/session_[0-9]* > /dev/null 2>&1; then
326+
for session_dir in /tmp/ray/session_[0-9]*/; do
327+
if [[ -d "\$session_dir/logs" ]]; then
328+
session_name=\$(basename "\$session_dir")
329+
mkdir -p "$LOG_DIR/ray/$node_i/\$session_name"
330+
if command -v rsync > /dev/null 2>&1; then
331+
rsync -ahP "\$session_dir/logs/" $LOG_DIR/ray/$node_i/\$session_name/logs/ 2>/dev/null || true
332+
else
333+
cp -r "\$session_dir/logs" $LOG_DIR/ray/$node_i/\$session_name/
334+
fi
335+
fi
336+
done
337+
fi
338+
if [[ -f "$LOG_DIR/ENDED" ]]; then
339+
echo "Log sync sidecar terminating..."
340+
break
341+
fi
342+
done
343+
}
344+
log-sync-sidecar &
345+
315346
# Patch nsight.py before starting Ray worker
316347
sed -i 's/context\.py_executable = " "\.join(self\.nsight_cmd) + " python"/context.py_executable = " ".join(self.nsight_cmd) + f" {context.py_executable}"/g' /opt/nemo_rl_venv/lib64/python*/site-packages/ray/_private/runtime_env/nsight.py
317348

0 commit comments

Comments
 (0)