[Bug] nvmlInit permanent failure in Burstable QoS pods due to pthread_once poisoning

Description
In Kubernetes 1.27+, Burstable Pods (where limits do not equal requests) can enter a resize: InProgress state during startup. When this happens, Kubernetes temporarily locks the GPU device files (/dev/nvidia*).
Related Issue:-  https://github.com/Project-HAMi/HAMi/issues/1704

The Problem:
HAMi-core uses pthread_once to initialize GPU maps. If the first attempt to start the GPU fails (because the device is locked), HAMi-core still marks the initialization as "Done." When the pod tries again a second later, HAMi-core thinks it is already finished and skips the setup. This causes the GPU to stay broken ("Unknown Error") for the entire life of the pod.

Root Cause Analysis
First Call: nvmlInit is called while the device is locked.

Driver Failure: The real NVIDIA library returns NVML_ERROR_UNKNOWN.

State Poisoning: Even though it failed, pthread_once marks the internal HAMi setup as "Finished."

Permanent Failure: Every future call to the GPU fails because the internal HAMi maps were never actually created.

Proposed Fix
We must only mark the initialization as "Done" if the NVIDIA driver actually returns NVML_SUCCESS. We should replace pthread_once with a manual check and a Mutex lock so it can be retried.

Suggested Code Fix
// 1. Add a manual flag and a lock at the top of the file

```
static pthread_mutex_t g_init_mutex = PTHREAD_MUTEX_INITIALIZER;
static int g_post_init_complete = 0;

```
// 2. Update the nvmlInit function
```
nvmlReturn_t nvmlInit(void) {
    // Keep preInit as is (it just sets up library paths)
    pthread_once(&init_virtual_map_pre_flag, (void(*)(void))nvml_preInit);

    // Call the real NVIDIA driver
    nvmlReturn_t res = NVML_OVERRIDE_CALL(nvml_library_entry, nvmlInit_v2);

    // FIX: Only run postInit if the driver actually succeeded!
    if (res == NVML_SUCCESS) {
        pthread_mutex_lock(&g_init_mutex);
        if (!g_post_init_complete) {
            nvml_postInit();
            g_post_init_complete = 1;
        }
        pthread_mutex_unlock(&g_init_mutex);
    }
    
    return res;
}
```
(Note: The same logic should be applied to nvmlInit_v2 and nvmlInitWithFlags in the same file.)

Environment Information
Component: HAMi-core (NVML Hooking)

Kubernetes Version: 1.27 or higher

Pod Configuration: QoS: Burstable

Error Seen: nvmlInit returns error code 1 or 999.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] nvmlInit permanent failure in Burstable QoS pods due to pthread_once poisoning #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] nvmlInit permanent failure in Burstable QoS pods due to pthread_once poisoning #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions