Skip to content

NCCL Benchmark Fails: nvidia-smi Not Found in Driver Chroot Environment #968

@dnugmanov

Description

@dnugmanov

Bug Description

The NCCL benchmark cronjob consistently fails with nvidia-smi: command not found when attempting to check for running GPU processes. The issue occurs during the isRunningProcessOnGPU() function call, which tries to execute chroot /run/nvidia/driver nvidia-smi.

Error Details

Error Message:

{"error":"failed to execute nvidia-smi: exit status 127","level":"fatal","msg":"Failed to check running processes on GPU","slurmNode":"worker-0","time":"2025-06-04T14:16:56Z"}

Exit Code: 127 (command not found)

Environment

  • Slurm Operator Version: Based on image cr.eu-north1.nebius.cloud/soperator/nccl_benchmark:1.19.0-jammy-slurm24.05.5
  • Container Runtime: CRI-O

Investigation Results

root@worker-0:/# ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 3 root root  60 May 27 12:39 .
drwxr-xr-x 6 root root 120 May 27 12:38 ..
drwxr-xr-x 2 root root  40 May 27 12:39 etc

root@worker-0:/# chroot /run/nvidia/driver nvidia-smi
chroot: failed to run command 'nvidia-smi': No such file or directory

root@worker-0:/# which nvidia-smi
/usr/bin/nvidia-smi

Could you please advise on this error?
Is this a misconfiguration on my side, or does the operator have strict placement requirements for the nvidia-smi binary in benchmark pods that I'm not meeting?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions