-
Notifications
You must be signed in to change notification settings - Fork 56
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
The NCCL benchmark cronjob consistently fails with nvidia-smi: command not found when attempting to check for running GPU processes. The issue occurs during the isRunningProcessOnGPU() function call, which tries to execute chroot /run/nvidia/driver nvidia-smi.
Error Details
Error Message:
{"error":"failed to execute nvidia-smi: exit status 127","level":"fatal","msg":"Failed to check running processes on GPU","slurmNode":"worker-0","time":"2025-06-04T14:16:56Z"}
Exit Code: 127 (command not found)
Environment
- Slurm Operator Version: Based on image
cr.eu-north1.nebius.cloud/soperator/nccl_benchmark:1.19.0-jammy-slurm24.05.5 - Container Runtime: CRI-O
Investigation Results
root@worker-0:/# ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 3 root root 60 May 27 12:39 .
drwxr-xr-x 6 root root 120 May 27 12:38 ..
drwxr-xr-x 2 root root 40 May 27 12:39 etc
root@worker-0:/# chroot /run/nvidia/driver nvidia-smi
chroot: failed to run command 'nvidia-smi': No such file or directory
root@worker-0:/# which nvidia-smi
/usr/bin/nvidia-smiCould you please advise on this error?
Is this a misconfiguration on my side, or does the operator have strict placement requirements for the nvidia-smi binary in benchmark pods that I'm not meeting?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working