Skip to content

Conversation

@ereslibre
Copy link
Member

@ereslibre ereslibre commented Jun 29, 2025

This allows users to keep using docker run --gpus. Despite CDI is the recommended way to expose GPU's to containers nowadays, allow users to keep using the old --gpus method.

Fixes: #419597
Fixes: #241316

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • Nixpkgs 25.11 Release Notes (or backporting 24.11 and 25.05 Nixpkgs Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
  • NixOS 25.11 Release Notes (or backporting 24.11 and 25.05 NixOS Release notes)
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md, pkgs/README.md, maintainers/README.md and other contributing documentation in corresponding paths.

Add a 👍 reaction to pull requests you find important.

@ereslibre ereslibre changed the title sudnvidia-container-toolkit: reintroduce nvidia runtime wrappers nvidia-container-toolkit: reintroduce nvidia runtime wrappers Jun 29, 2025
@ereslibre ereslibre marked this pull request as draft June 29, 2025 19:02
@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 6.topic: nvidia Nvidia-specific issues and fixes labels Jun 29, 2025
@nix-owners nix-owners bot requested review from christoph-heiss and cpcloud June 29, 2025 19:08
@ereslibre ereslibre force-pushed the fix-gpus-argument-with-cdi branch 6 times, most recently from 68c45d3 to 96de344 Compare June 30, 2025 07:13
@ereslibre ereslibre self-assigned this Jun 30, 2025
@ereslibre ereslibre marked this pull request as ready for review June 30, 2025 07:17
@ereslibre ereslibre force-pushed the fix-gpus-argument-with-cdi branch from 96de344 to f89700c Compare June 30, 2025 15:27
@ConnorBaker
Copy link
Contributor

I'm not opposed to this, but from previous experience this did tend to break quite frequently. Is there some way to attach your handle to this so people know you're the code owner or maintainer? I personally don't have the experience necessary to know how to fix or debug issues with this.

@ereslibre
Copy link
Member Author

Hi @ConnorBaker!

I don’t know if there is a way to do so. If there is, I am completely fine getting tagged with things like this. I have been maintaining the GPU support through CDI for some time now.

This should not be as brittle as before, since Docker is doing some of the heavy lifting nowadays, even with the nvidia-container-cli. Things are not as brittle as they were with the docker-nvidia wrapper.

@ConnorBaker
Copy link
Contributor

And to clarify, this shouldn't conflict with using the CDI options with Docker or Podman, correct?

@ereslibre
Copy link
Member Author

ereslibre commented Jul 1, 2025

@ConnorBaker

And to clarify, this shouldn't conflict with using the CDI options with Docker or Podman, correct?

No, not at all. Both —-device and —-gpus are supported, and —-device (CDI) is recommended.

@nixpkgs-ci nixpkgs-ci bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Jul 12, 2025
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/5648

@ereslibre ereslibre force-pushed the fix-gpus-argument-with-cdi branch 2 times, most recently from 8288b53 to 49643e0 Compare July 15, 2025 15:12
@nixpkgs-ci nixpkgs-ci bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Jul 15, 2025
@ereslibre ereslibre force-pushed the fix-gpus-argument-with-cdi branch from 49643e0 to 60797c3 Compare July 15, 2025 15:16
Copy link
Contributor

@ConnorBaker ConnorBaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this took so long to get back to you!

@ereslibre ereslibre force-pushed the fix-gpus-argument-with-cdi branch from 60797c3 to 6381723 Compare July 16, 2025 19:19
@ereslibre ereslibre requested a review from ConnorBaker July 16, 2025 19:19
@ereslibre ereslibre force-pushed the fix-gpus-argument-with-cdi branch from 6381723 to c643e15 Compare July 16, 2025 20:16
This allows users to keep using `docker run --gpus`. Despite CDI is
the recommended way to expose GPU's to containers nowadays, allow
users to keep using the old `--gpus` method.
@ereslibre ereslibre force-pushed the fix-gpus-argument-with-cdi branch from c643e15 to 117bbae Compare July 16, 2025 20:38
@ConnorBaker
Copy link
Contributor

Nicely done! Thank you again for working on this -- if you're satisfied with it, let me know and I'll merge it.

@nixpkgs-ci nixpkgs-ci bot added the 12.approvals: 1 This PR was reviewed and approved by one person. label Jul 16, 2025
@ConnorBaker
Copy link
Contributor

Slightly related @ereslibre -- working on #425862 to fix a few CVEs, I found the tests were broken. I'm not going to get the chance to fix them any time soon, and you're familiar with the module -- would you mind fixing them up? More than happy to review and merge a fix!

@ereslibre
Copy link
Member Author

@ConnorBaker

Nicely done! Thank you again for working on this -- if you're satisfied with it, let me know and I'll merge it.

Thank you! Yes, please, go ahead and let's merge this one :)

@ereslibre
Copy link
Member Author

@ConnorBaker

I'm not going to get the chance to fix them any time soon, and you're familiar with the module

I'm going to check it out when I have some time, it could span a couple of weeks though, I will ping you if/when I have a fix for them.

@ereslibre
Copy link
Member Author

ereslibre commented Jul 17, 2025

@ConnorBaker I have created a couple PR against your repo to fix tests, both for master and for 25.05:

@ConnorBaker ConnorBaker merged commit fc6bc86 into NixOS:master Jul 17, 2025
25 of 28 checks passed
@ereslibre ereslibre deleted the fix-gpus-argument-with-cdi branch July 17, 2025 08:52
@gregorburger
Copy link

Hi @ereslibre

I'm on nixos-unstable d0fc308 and --gpus all does not seem to work.

this works:

$ docker run --device=nvidia.com/gpu=all --rm ubuntu nvidia-smi
Thu Sep  4 10:02:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.181                Driver Version: 570.181        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               Off |   00000000:01:00.0 Off |                  Off |
| 41%   31C    P8              6W /  140W |       1MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

this not:

$ docker run --gpus all --rm ubuntu nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

these two not as well:

$ docker run --device=nvidia.com/gpu=all --rm nixos/nix nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH: unknown.
$ docker run --device=nvidia.com/gpu=all -v /nix/:/nix --rm ubuntu nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running createContainer hook #0: exit status 1, stdout: , stderr: time="2025-09-04T12:10:13+02:00" level=info msg="Symlinking /var/lib/docker/overlay2/1c08cb1ea72a6c934815ebc36d42276c5c6533879d3d307993e40462cff7a1e8/merged/nix/store/bkrwym33588jsrifg72xs4928l1m6sip-nvidia-x11-570.181-6.12.44/lib/gbm/nvidia-drm_gbm.so to ../libnvidia-allocator.so.1"
time="2025-09-04T12:10:13+02:00" level=error msg="failed to create link [../libnvidia-allocator.so.1 /nix/store/bkrwym33588jsrifg72xs4928l1m6sip-nvidia-x11-570.181-6.12.44/lib/gbm/nvidia-drm_gbm.so]: failed to create symlink: failed to remove existing file: remove /var/lib/docker/overlay2/1c08cb1ea72a6c934815ebc36d42276c5c6533879d3d307993e40462cff7a1e8/merged/nix/store/bkrwym33588jsrifg72xs4928l1m6sip-nvidia-x11-570.181-6.12.44/lib/gbm/nvidia-drm_gbm.so: read-only file system": unknown.

just wanted to know if this is supposed to work before filing an issue.

thanks

@ereslibre
Copy link
Member Author

ereslibre commented Sep 4, 2025

Hi @gregorburger!

this not:

$ docker run --gpus all --rm ubuntu nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Can you share your configuration?

these two not as well:

$ docker run --device=nvidia.com/gpu=all --rm nixos/nix nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH: unknown.

This is expected, as the $PATH does not contain /usr/bin:

❯ docker run --device=nvidia.com/gpu=all --rm nixos/nix sh -c 'echo $PATH'
/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:/nix/var/nix/profiles/default/sbin

However, this should work:

❯ docker run --device=nvidia.com/gpu=all --rm nixos/nix /usr/bin/nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-c475e08b-0cc5-f5aa-4326-99699429b449)
GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: GPU-5cca1a6f-7cee-b649-40f0-2d3ecb0aa207)

@gregorburger
Copy link

gregorburger commented Sep 8, 2025

@ereslibre Sorry for the late response! ... NixCon

This one works for me as well:
❯ docker run --device=nvidia.com/gpu=all --rm nixos/nix /usr/bin/nvidia-smi -L GPU 0: NVIDIA RTX A4000 (UUID: GPU-4dbac442-32b0-cdd9-7fef-a08f32ebb420)

My config is essentially this:

  ...
  hardware.graphics.enable = true;
  services.xserver.videoDrivers = [ "nvidia" ];  
  hardware.nvidia.open = false;   
  
  hardware.nvidia-container-toolkit.enable = true;

  virtualisation.docker.enable = true;
  ...

Relevant users are in the docker group.

@ereslibre
Copy link
Member Author

@gregorburger,

@ereslibre Sorry for the late response! ... NixCon

Some folks really know how to take good care of themselves! Ha! Hope you enjoyed it! :)

Ah, sorry, alright, I missed the -v /nix:/nix bit on your last command! Alright, so:

  • docker run --gpus all --rm ubuntu nvidia-smi

This is the one that puzzles me based on your config; are you targeting rootful docker?

  • docker run --device=nvidia.com/gpu=all --rm nixos/nix nvidia-smi

Expected.

  • docker run --device=nvidia.com/gpu=all -v /nix/:/nix --rm ubuntu nvidia-smi

Already tracked in #441227, thanks for opening the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: nvidia Nvidia-specific issues and fixes 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 12.approvals: 1 This PR was reviewed and approved by one person.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docker: nvidia container runtime (--gpus all) broken on NixOS 25.05 distrobox: Using the GPU inside the container with --nvidia does not work

4 participants