-
Notifications
You must be signed in to change notification settings - Fork 407
Description
What happened:
When using NRI to inject into the GPU, there might be a problem of losing device cgroup permissions.
In our production, we use the NRI injection similar to koordinator for implementation. We set the NVIDIA_VISIBLE_DEVICES environment variable and then let the nvidia-container-runtime handle it.
However, a very strange thing occurred. In the device cgroup of the injected container, the device numbers of these GPU cards would suddenly disappear from the devices.list. After investigation, we found that it was triggered by systemd reload command. Once the node executed systemctl daemon-reload, the device cgroup would be reset, and then the GPU device permissions would be lost directly.
Further research revealed that the corresponding container's permission configuration was not saved in /run/systemd/transient/. The specific location is /run/systemd/transient/*/50-DeviceAllow.conf. This should be written by runc after reading the OCI spec. In the implementation of the device-plugin, kubelet writes the OCI spec so that runc can write to /run/systemd/transient. This logic is not implemented in the NRI mechanism.
This issue only occurs when the nvidia-container-runtime is using the legacy mode; the cdi mode can be used to avoid it.
What you expected to happen:
I'm not sure if koordlet has a similar issue. It's not very convenient for me to simulate an environment. Judging from the code, it seems that there is no logic to supplement the OCI spec. It's very likely that it can be reproduced.
Here, it might be necessary to consider calling adjustment.AddDevice in the NRI injection logic to replace the kubelet's device-plugin mechanism to complete the device information in the OCI spec. Otherwise, systemctl daemon-reload will cause all GPU permissions on the entire node to be lost.
Environment:
Could someone reproduce this scenario conveniently? You could check whether the GPU device permissions of the pod have been persisted in systemd by viewing /run/systemd/transient/. The prerequisite is to set the nvidia-container-runtime to the legacy mode, that is, to use the method of NVIDIA Container Runtime Hook.