Skip to content

Commit 2aa2397

Browse files
committed
Update CDI service instructions
Signed-off-by: Evan Lezar <[email protected]>
1 parent 74ad2ed commit 2aa2397

File tree

1 file changed

+62
-76
lines changed

1 file changed

+62
-76
lines changed

container-toolkit/cdi-support.md

Lines changed: 62 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -34,32 +34,46 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f
3434

3535
As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service:
3636

37-
- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when NVIDIA drivers are installed or upgraded
38-
- Runs automatically on system boot to ensure the specification is up to date
37+
- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when:
38+
- The NVIDIA Container Toolkit is installed or upgraded
39+
- The NVIDIA GPU drivers are installed or upgraded
40+
- The system is rebooted
3941

40-
```{note}
41-
The automatic CDI refresh service does not handle:
42-
- Driver removal (the CDI file is intentionally preserved)
43-
- MIG device reconfiguration
42+
This ensures that the CDI specifications are up to date for the current driver
43+
and device configuration and that CDI Devices defined in these speciciations are
44+
available when using native CDI support in container engines such as Docker or Podman.
4445

45-
For these scenarios, you may still need to manually regenerate the CDI specification. See [Manual CDI Specification Generation](#manual-cdi-specification-generation) for instructions.
46+
Running the following command will give a list of availble CDI Devices:
47+
```console
48+
nvidia-ctk cdi list
4649
```
4750

48-
#### Customizing the Automatic CDI Refresh Service
51+
#### Known limitations
52+
The `nvidia-cdi-refresh` service does not currently handle the following situations:
53+
54+
- The removal of NVIDIA GPU drivers
55+
- The reconfiguration of MIG devices
56+
57+
For these scenarios, the regeneration of CDI specifications must be [manually triggered](#manual-cdi-specification-generation).
4958

50-
You can customize the behavior of the `nvidia-cdi-refresh` service by adding environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env`. This file is read by the service and allows you to modify the `nvidia-ctk cdi generate` command behavior.
59+
#### Customizing the Automatic CDI Refresh Service
60+
The behavior of the `nvidia-cdi-refresh` service can be customized by adding
61+
environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env` to
62+
affect the behavior of the `nvidia-ctk cdi generate` command.
5163

52-
Example configuration file:
64+
As an example, to enable debug logging the configuration file should be updated
65+
as follows:
5366
```bash
5467
# /etc/nvidia-container-toolkit/cdi-refresh.env
5568
NVIDIA_CTK_DEBUG=1
56-
# Add other nvidia-ctk environment variables as needed
5769
```
5870

5971
For a complete list of available environment variables, run `nvidia-ctk cdi generate --help` to see the command's documentation.
6072

6173
```{important}
62-
After modifying the environment file, you must reload the systemd daemon and restart the service for changes to take effect:
74+
Modifications to the environment file required a systemd reload and restarting the
75+
service to take effect
76+
```
6377

6478
```console
6579
$ sudo systemctl daemon-reload
@@ -70,19 +84,24 @@ $ sudo systemctl restart nvidia-cdi-refresh.service
7084

7185
The `nvidia-cdi-refresh` service consists of two systemd units:
7286

73-
- `nvidia-cdi-refresh.path` - Monitors for changes to driver files and triggers the service
74-
- `nvidia-cdi-refresh.service` - Executes the CDI specification generation
87+
- `nvidia-cdi-refresh.path`: Montiors for for changes to the system and triggers the service
88+
- `nvidia-cdi-refresh.service`: Generates the CDI specifications for all available devices based on
89+
the default configuration and any overrides in the environment file.
7590

76-
You can manage these services using standard systemd commands:
91+
These services can be managed using standard systemd commands.
92+
93+
When working as expected, the `nvidia-cdi-refresh.path` service will be enabled and active, and the
94+
`nvidia-cdi-refresh.service` will be enabled and have run at least once. For example:
7795

7896
```console
79-
# Check service status
8097
$ sudo systemctl status nvidia-cdi-refresh.path
8198
● nvidia-cdi-refresh.path - Trigger CDI refresh on NVIDIA driver install / uninstall events
8299
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.path; enabled; preset: enabled)
83100
Active: active (waiting) since Fri 2025-06-27 06:04:54 EDT; 1h 47min ago
84101
Triggers: ● nvidia-cdi-refresh.service
102+
```
85103

104+
```console
86105
$ sudo systemctl status nvidia-cdi-refresh.service
87106
○ nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file
88107
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.service; enabled; preset: enabled)
@@ -91,87 +110,54 @@ TriggeredBy: ● nvidia-cdi-refresh.path
91110
Process: 1317511 ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml (code=exited, status=0/SUCCESS)
92111
Main PID: 1317511 (code=exited, status=0/SUCCESS)
93112
CPU: 562ms
94-
95-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi"
96-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump"
97-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced"
98-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-control as /usr/bin/nvidia-cuda-mps-control"
99-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-server as /usr/bin/nvidia-cuda-mps-server"
100-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=warning msg="Could not locate nvidia-imex: pattern nvidia-imex not found"
101-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=warning msg="Could not locate nvidia-imex-ctl: pattern nvidia-imex-ctl not found"
102-
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Generated CDI spec with version 1.0.0"
103-
Jun 27 00:04:30 ipp2-0502 systemd[1]: nvidia-cdi-refresh.service: Succeeded.
104-
Jun 27 00:04:30 ipp2-0502 systemd[1]: Started Refresh NVIDIA CDI specification file.
113+
...
105114
```
106115

107-
You can enable/disable the automatic CDI refresh service using the following commands:
116+
If these are not enabled as expected, they can be enabled by running:
108117

109118
```console
110119
$ sudo systemctl enable --now nvidia-cdi-refresh.path
111120
$ sudo systemctl enable --now nvidia-cdi-refresh.service
112-
$ sudo systemctl disable nvidia-cdi-refresh.service
113-
$ sudo systemctl disable nvidia-cdi-refresh.path
114121
```
115122

116-
You can also view the service logs to see the output of the CDI generation process.
123+
#### Troubleshooting CDI Specification Generation and Resolution
124+
125+
If CDI specifications for available devices are not generated / updated as expected, it is
126+
recommended that the logs for the `nvidia-cdi-refresh.service` be checked. This can be
127+
done by running:
117128

118129
```console
119-
# View service logs
120130
$ sudo journalctl -u nvidia-cdi-refresh.service
121131
```
122132

123-
### Manual CDI Specification Generation
124-
125-
If you need to manually generate a CDI specification, for example, after MIG configuration changes or if you are using a Container Toolkit version before v1.18.0, follow this procedure:
126-
127-
Two common locations for CDI specifications are `/etc/cdi/` and `/var/run/cdi/`.
128-
The contents of the `/var/run/cdi/` directory are cleared on boot.
129-
130-
However, the path to create and use can depend on the container engine that you use.
131-
132-
1. Generate the CDI specification file:
133+
In most cases, restarting the service should be sufficient to trigger the (re)generation
134+
of CDI specifications:
133135

134-
```console
135-
$ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
136-
```
137-
138-
The sample command uses `sudo` to ensure that the file at `/var/run/cdi/nvidia.yaml` is created.
139-
You can omit the `--output` argument to print the generated specification to `STDOUT`.
140-
141-
*Example Output*
142-
143-
```output
144-
INFO[0000] Auto-detected mode as "nvml"
145-
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0
146-
INFO[0000] Selecting /dev/dri/card1 as /dev/dri/card1
147-
INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128
148-
INFO[0000] Using driver version xxx.xxx.xx
149-
...
150-
```
151-
152-
1. (Optional) Check the names of the generated devices:
136+
```console
137+
$ sudo systemctl restart nvidia-cdi-refresh.service
138+
```
153139

154-
```console
155-
$ nvidia-ctk cdi list
156-
```
140+
Running:
157141

158-
The following example output is for a machine with a single GPU that does not support MIG.
142+
```console
143+
$ nvidia-ctk --debug cdi list
144+
```
145+
will show a list of available CDI Devices as well as any errors that may have
146+
occurred when loading CDI Specifications from `/etc/cdi` or `/var/run/cdi`.
159147

160-
```output
161-
INFO[0000] Found 9 CDI devices
162-
nvidia.com/gpu=all
163-
nvidia.com/gpu=0
164-
```
148+
### Manual CDI Specification Generation
165149

166-
```{important}
167-
You must generate a new CDI specification after any of the following changes:
150+
As of the NVIDIA Container Toolkit `v1.18.0` the recommended mechanism to regenerate CDI specifications is to restart the `nvidia-cdi-refresh.service`:
168151

169-
- You change the device or CUDA driver configuration.
170-
- You use a location such as `/var/run/cdi` that is cleared on boot.
152+
```console
153+
$ sudo systemctl restart nvidia-cdi-refresh.service
154+
```
171155

172-
A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded.
156+
If this does not work, or more flexibility is required, the `nvidia-ctk cdi generate` command
157+
can be used directly:
173158

174-
**Note**: As of NVIDIA Container Toolkit v1.18.0, the automatic CDI refresh service handles most of these scenarios automatically.
159+
```console
160+
$ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
175161
```
176162

177163
## Running a Workload with CDI

0 commit comments

Comments
 (0)