-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Describe the bug
We have upgraded to the latest nvidia-gpu-operator version 25.10.1. In doing so, the image version of nvidia-mig-manager is set to nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.13.1.
While generating of the specification file via cdi generate via the system service it always failed after a kernel update and a corresponding reboot, we set about finding the cause.
Ultimately, we discovered that nvidia-mig-manager > v.12.x fails at startup or when mig-config labels are changed:
nvidia-mig-manager IS_HOST_DRIVER=true
nvidia-mig-manager NVIDIA_DRIVER_ROOT=/
nvidia-mig-manager DRIVER_ROOT_CTR_PATH=/host
nvidia-mig-manager NVIDIA_DEV_ROOT=/
nvidia-mig-manager DEV_ROOT_CTR_PATH=/host
nvidia-mig-manager WITH_SHUTDOWN_HOST_GPU_CLIENTS=true
nvidia-mig-manager Starting nvidia-mig-manager
nvidia-mig-manager W0311 11:57:32.053406 1658603 client_config.go:667] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Updating to MIG config: config13"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.device-plugin=true'\n"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'\n"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'\n"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.dcgm=true'\n"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.nvsm=true'\n"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Current value of 'nvidia.com/mig.config.state=success'\n"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
nvidia-mig-manager /usr/local/nvidia/mig-manager/nvidia-mig-parted: /lib64/libc.so.6: versionGLIBC_2.32' not found (required by /usr/local/nvidia/mig-manager/nvidia-mig-parted) nvidia-mig-manager /usr/local/nvidia/mig-manager/nvidia-mig-parted: /lib64/libc.so.6: versionGLIBC_2.34' not found (required by /usr/local/nvidia/mig-manager/nvidia-mig-parted)
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Restarting any GPU clients previously shutdown on the host by restarting their systemd services"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Changing the 'nvidia.com/mig.config.state' node label to 'failed'\n"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=error msg="Error: failed to validate MIG configuration: exit status 1"
nvidia-mig-manager time="2026-03-11T11:57:32Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
The error is “Error: failed to validate MIG configuration: exit status 1”.
However, if you execute in the nvidia-mig-manager pod and try to apply the config manually from the corresponding config file...
/mig-parted-config # nvidia-mig-parted apply -f config.yaml -c config13
MIG configuration applied successfully
/mig-parted-config # nvidia-mig-parted assert -f config.yaml -c config13
Selected MIG configuration currently applied
... this is successful and the CDI generate or CUDA validator then terminates successfully.
We then downgraded only the nvidia-mig-manager to v0.12.3 and came to the conclusion that everything still works as desired here:
nvidia-mig-manager IS_HOST_DRIVER=true
nvidia-mig-manager NVIDIA_DRIVER_ROOT=/
nvidia-mig-manager DRIVER_ROOT_CTR_PATH=/host
nvidia-mig-manager NVIDIA_DEV_ROOT=/
nvidia-mig-manager DEV_ROOT_CTR_PATH=/host
nvidia-mig-manager WITH_SHUTDOWN_HOST_GPU_CLIENTS=true
nvidia-mig-manager Starting nvidia-mig-manager
nvidia-mig-manager W0311 11:55:17.004069 1655920 client_config.go:667] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
nvidia-mig-manager time="2026-03-11T11:55:17Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
nvidia-mig-manager time="2026-03-11T11:55:17Z" level=info msg="Updating to MIG config: config13"
nvidia-mig-manager Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
nvidia-mig-manager Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
nvidia-mig-manager Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
nvidia-mig-manager Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
nvidia-mig-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
nvidia-mig-manager Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
nvidia-mig-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
nvidia-mig-manager Current value of 'nvidia.com/gpu.deploy.dcgm=true'
nvidia-mig-manager Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
nvidia-mig-manager Current value of 'nvidia.com/gpu.deploy.nvsm=true'
nvidia-mig-manager Asserting that the requested configuration is present in the configuration file
nvidia-mig-manager Selected MIG configuration is valid
nvidia-mig-manager Getting current value of the 'nvidia.com/mig.config.state' node label
nvidia-mig-manager Current value of 'nvidia.com/mig.config.state=failed'
nvidia-mig-manager Checking if the selected MIG config is currently applied or not
nvidia-mig-manager Selected MIG configuration currently applied
nvidia-mig-manager Restarting any GPU clients previously shutdown on the host by restarting their systemd services
nvidia-mig-manager Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
nvidia-mig-manager node/cbd00199.lan.xxx.de not labeled
nvidia-mig-manager Changing the 'nvidia.com/mig.config.state' node label to 'success'
nvidia-mig-manager node/cbd00199.lan.xxx.de labeled
nvidia-mig-manager time="2026-03-11T11:55:19Z" level=info msg="Successfully updated to MIG config: config13"
nvidia-mig-manager time="2026-03-11T11:55:19Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
We tested this with drivers 570.211.01 and 580.126.20 and toolkit versions 1.18.0 and 1.18.2, and always got the same result with the new image from the nvidia-mig-manager.
Does anyone have any idea what change in the current version of nvidia-mig-manager has caused it to stop working, or what the exact cause is? I don't think it has anything to do with the ConfigMap, as it looks exactly as it should and the manual apply in the pod works without any problems. What exactly is being validated in “Error: failed to validate MIG configuration: exit status 1” and why is it failing?
To Reproduce
For example, use driver 570.211.01 with nvidia-toolkit version 1.18.0 (GPU operator version 25.10.1) and store the MIG specification for the host node with an A100 graphics card and check whether nvidia-mig-manager in version v0.13. 1 can successfully perform the configuration for the node.
Expected behavior
Successful application of the corresponding MIG configuration also in the new nvidia-mig-manager version.
Environment (please provide the following information):
- mig-parted version: 0.13.1
- Host OS: RHEL 8.10
- Kernel Version: 4.18.0-553.107.1.el8_10.x86_64
- NVIDIA Driver Version: 570.211.01
- GPU Model(s): A100