Skip to content

[Bug] nvmlInit permanent failure in Burstable QoS pods due to pthread_once poisoning #167

@maishivamhoo123

Description

@maishivamhoo123

Description
In Kubernetes 1.27+, Burstable Pods (where limits do not equal requests) can enter a resize: InProgress state during startup. When this happens, Kubernetes temporarily locks the GPU device files (/dev/nvidia*).
Related Issue:- Project-HAMi/HAMi#1704

The Problem:
HAMi-core uses pthread_once to initialize GPU maps. If the first attempt to start the GPU fails (because the device is locked), HAMi-core still marks the initialization as "Done." When the pod tries again a second later, HAMi-core thinks it is already finished and skips the setup. This causes the GPU to stay broken ("Unknown Error") for the entire life of the pod.

Root Cause Analysis
First Call: nvmlInit is called while the device is locked.

Driver Failure: The real NVIDIA library returns NVML_ERROR_UNKNOWN.

State Poisoning: Even though it failed, pthread_once marks the internal HAMi setup as "Finished."

Permanent Failure: Every future call to the GPU fails because the internal HAMi maps were never actually created.

Proposed Fix
We must only mark the initialization as "Done" if the NVIDIA driver actually returns NVML_SUCCESS. We should replace pthread_once with a manual check and a Mutex lock so it can be retried.

Suggested Code Fix
// 1. Add a manual flag and a lock at the top of the file

static pthread_mutex_t g_init_mutex = PTHREAD_MUTEX_INITIALIZER;
static int g_post_init_complete = 0;

// 2. Update the nvmlInit function

nvmlReturn_t nvmlInit(void) {
    // Keep preInit as is (it just sets up library paths)
    pthread_once(&init_virtual_map_pre_flag, (void(*)(void))nvml_preInit);

    // Call the real NVIDIA driver
    nvmlReturn_t res = NVML_OVERRIDE_CALL(nvml_library_entry, nvmlInit_v2);

    // FIX: Only run postInit if the driver actually succeeded!
    if (res == NVML_SUCCESS) {
        pthread_mutex_lock(&g_init_mutex);
        if (!g_post_init_complete) {
            nvml_postInit();
            g_post_init_complete = 1;
        }
        pthread_mutex_unlock(&g_init_mutex);
    }
    
    return res;
}

(Note: The same logic should be applied to nvmlInit_v2 and nvmlInitWithFlags in the same file.)

Environment Information
Component: HAMi-core (NVML Hooking)

Kubernetes Version: 1.27 or higher

Pod Configuration: QoS: Burstable

Error Seen: nvmlInit returns error code 1 or 999.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions