You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nvidia-mig-manager.service consumes excessive CPU time (~42s on DGX B300
with 8x B200 GPUs) for operations completing in ~9s wall time. The root
cause is 68-100 redundant NVML Init/Shutdown cycles per service run.
Each nvml.Init() triggers dlopen("libnvidia-ml.so.1") + 24 dlsym() calls
to resolve versioned API symbols — expensive on multi-GPU systems. The
overhead comes from two compounding patterns:
1. Every method on nvmlMigModeManager and nvmlMigConfigManager
independently calls Init()/Shutdown(), despite callers already
maintaining an initialized NVML instance at the command level.
2. Callers create new nvml.New() instances inside per-GPU loops,
each triggering a full Init/Shutdown cycle including version checks.
Fix by:
- Accepting nvml.Interface in constructors (aligning real constructors
with mock constructors that already accept it)
- Removing per-method Init/Shutdown from all 7 manager methods
- Hoisting manager creation out of per-device loops to create once
per command
This reduces NVML Init/Shutdown from ~100 to 1 per command, cutting CPU
time by 4.7x and dlsym calls from 572 to 45 (12.7x reduction).
Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
0 commit comments