-
Notifications
You must be signed in to change notification settings - Fork 146
[Bug] nvmlInit permanent failure in Burstable QoS pods due to pthread_once poisoning #167
Description
Description
In Kubernetes 1.27+, Burstable Pods (where limits do not equal requests) can enter a resize: InProgress state during startup. When this happens, Kubernetes temporarily locks the GPU device files (/dev/nvidia*).
Related Issue:- Project-HAMi/HAMi#1704
The Problem:
HAMi-core uses pthread_once to initialize GPU maps. If the first attempt to start the GPU fails (because the device is locked), HAMi-core still marks the initialization as "Done." When the pod tries again a second later, HAMi-core thinks it is already finished and skips the setup. This causes the GPU to stay broken ("Unknown Error") for the entire life of the pod.
Root Cause Analysis
First Call: nvmlInit is called while the device is locked.
Driver Failure: The real NVIDIA library returns NVML_ERROR_UNKNOWN.
State Poisoning: Even though it failed, pthread_once marks the internal HAMi setup as "Finished."
Permanent Failure: Every future call to the GPU fails because the internal HAMi maps were never actually created.
Proposed Fix
We must only mark the initialization as "Done" if the NVIDIA driver actually returns NVML_SUCCESS. We should replace pthread_once with a manual check and a Mutex lock so it can be retried.
Suggested Code Fix
// 1. Add a manual flag and a lock at the top of the file
static pthread_mutex_t g_init_mutex = PTHREAD_MUTEX_INITIALIZER;
static int g_post_init_complete = 0;
// 2. Update the nvmlInit function
nvmlReturn_t nvmlInit(void) {
// Keep preInit as is (it just sets up library paths)
pthread_once(&init_virtual_map_pre_flag, (void(*)(void))nvml_preInit);
// Call the real NVIDIA driver
nvmlReturn_t res = NVML_OVERRIDE_CALL(nvml_library_entry, nvmlInit_v2);
// FIX: Only run postInit if the driver actually succeeded!
if (res == NVML_SUCCESS) {
pthread_mutex_lock(&g_init_mutex);
if (!g_post_init_complete) {
nvml_postInit();
g_post_init_complete = 1;
}
pthread_mutex_unlock(&g_init_mutex);
}
return res;
}
(Note: The same logic should be applied to nvmlInit_v2 and nvmlInitWithFlags in the same file.)
Environment Information
Component: HAMi-core (NVML Hooking)
Kubernetes Version: 1.27 or higher
Pod Configuration: QoS: Burstable
Error Seen: nvmlInit returns error code 1 or 999.