NCCL Benchmark Fails: nvidia-smi Not Found in Driver Chroot Environment

## Bug Description

The NCCL benchmark cronjob consistently fails with `nvidia-smi: command not found` when attempting to check for running GPU processes. The issue occurs during the `isRunningProcessOnGPU()` [function call](https://github.com/nebius/soperator/blob/7074897e5568c73271d8b337f7c1a900b522b29b/images/worker/gpubench/main.go#L236), which tries to execute `chroot /run/nvidia/driver nvidia-smi`.

## Error Details

**Error Message:**
```
{"error":"failed to execute nvidia-smi: exit status 127","level":"fatal","msg":"Failed to check running processes on GPU","slurmNode":"worker-0","time":"2025-06-04T14:16:56Z"}
```

**Exit Code:** 127 (command not found)

## Environment

- **Slurm Operator Version:** Based on image `cr.eu-north1.nebius.cloud/soperator/nccl_benchmark:1.19.0-jammy-slurm24.05.5`
- **Container Runtime:** CRI-O


### Investigation Results
```bash
root@worker-0:/# ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 3 root root  60 May 27 12:39 .
drwxr-xr-x 6 root root 120 May 27 12:38 ..
drwxr-xr-x 2 root root  40 May 27 12:39 etc

root@worker-0:/# chroot /run/nvidia/driver nvidia-smi
chroot: failed to run command 'nvidia-smi': No such file or directory

root@worker-0:/# which nvidia-smi
/usr/bin/nvidia-smi
```

Could you please advise on this error?
Is this a misconfiguration on my side, or does the operator have strict placement requirements for the nvidia-smi binary in benchmark pods that I'm not meeting?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Benchmark Fails: nvidia-smi Not Found in Driver Chroot Environment #968

Bug Description

Error Details

Environment

Investigation Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL Benchmark Fails: nvidia-smi Not Found in Driver Chroot Environment #968

Description

Bug Description

Error Details

Environment

Investigation Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions