Skip to content

[question] Perhaps there is some flaw when using NRI to inject into the GPU #2809

@LavenderQAQ

Description

@LavenderQAQ

What happened:

When using NRI to inject into the GPU, there might be a problem of losing device cgroup permissions.
In our production, we use the NRI injection similar to koordinator for implementation. We set the NVIDIA_VISIBLE_DEVICES environment variable and then let the nvidia-container-runtime handle it.

However, a very strange thing occurred. In the device cgroup of the injected container, the device numbers of these GPU cards would suddenly disappear from the devices.list. After investigation, we found that it was triggered by systemd reload command. Once the node executed systemctl daemon-reload, the device cgroup would be reset, and then the GPU device permissions would be lost directly.

Further research revealed that the corresponding container's permission configuration was not saved in /run/systemd/transient/. The specific location is /run/systemd/transient/*/50-DeviceAllow.conf. This should be written by runc after reading the OCI spec. In the implementation of the device-plugin, kubelet writes the OCI spec so that runc can write to /run/systemd/transient. This logic is not implemented in the NRI mechanism.

This issue only occurs when the nvidia-container-runtime is using the legacy mode; the cdi mode can be used to avoid it.

What you expected to happen:

I'm not sure if koordlet has a similar issue. It's not very convenient for me to simulate an environment. Judging from the code, it seems that there is no logic to supplement the OCI spec. It's very likely that it can be reproduced.

Here, it might be necessary to consider calling adjustment.AddDevice in the NRI injection logic to replace the kubelet's device-plugin mechanism to complete the device information in the OCI spec. Otherwise, systemctl daemon-reload will cause all GPU permissions on the entire node to be lost.

Environment:

Could someone reproduce this scenario conveniently? You could check whether the GPU device permissions of the pod have been persisted in systemd by viewing /run/systemd/transient/. The prerequisite is to set the nvidia-container-runtime to the legacy mode, that is, to use the method of NVIDIA Container Runtime Hook.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions