[question] Perhaps there is some flaw when using NRI to inject into the GPU

**What happened:**

When using NRI to inject into the GPU, there might be a problem of losing device cgroup permissions. 
In our production, we use the NRI injection similar to `koordinator` for implementation. We set the `NVIDIA_VISIBLE_DEVICES` environment variable and then let the `nvidia-container-runtime` handle it.

However, a very strange thing occurred. In the device cgroup of the injected container, the device numbers of these GPU cards would suddenly disappear from the `devices.list`. After investigation, we found that it was triggered by `systemd reload` command. Once the node executed `systemctl daemon-reload`, the device cgroup would be reset, and then the GPU device permissions would be lost directly.

Further research revealed that the corresponding container's permission configuration was not saved in `/run/systemd/transient/`. The specific location is `/run/systemd/transient/*/50-DeviceAllow.conf`. This should be written by [runc](https://github.com/opencontainers/runc/blob/f047c6b0f88f3299e71d6be7508dfdaa824f2117/vendor/github.com/opencontainers/cgroups/systemd/common.go#L138) after reading the OCI spec. In the implementation of the `device-plugin`, [kubelet](https://github.com/kubernetes/kubernetes/blob/5adfc48e19d5fbc4af5b0d31aeb9f0c13c01cf5d/pkg/kubelet/kuberuntime/kuberuntime_container.go#L440) writes the OCI spec so that runc can write to `/run/systemd/transient`. This logic is not implemented in the NRI mechanism.

This issue only occurs when the `nvidia-container-runtime` is using the `legacy` mode; the `cdi` mode can be used to avoid it.

**What you expected to happen:**

I'm not sure if koordlet has a similar issue. It's not very convenient for me to simulate an environment. Judging from the code, it seems that there is no logic to supplement the OCI spec. It's very likely that it can be reproduced.

Here, it might be necessary to consider calling `adjustment.AddDevice` in the NRI injection logic to replace the kubelet's device-plugin mechanism to complete the device information in the OCI spec. Otherwise, `systemctl daemon-reload` will cause all GPU permissions on the entire node to be lost.

**Environment:**

Could someone reproduce this scenario conveniently? You could check whether the GPU device permissions of the pod have been persisted in systemd by viewing `/run/systemd/transient/`. The prerequisite is to set the `nvidia-container-runtime` to the legacy mode, that is, to use the method of `NVIDIA Container Runtime Hook`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Perhaps there is some flaw when using NRI to inject into the GPU #2809

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[question] Perhaps there is some flaw when using NRI to inject into the GPU #2809

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions