Skip to content

[Performance] High latency (~1 minute) when hundreds of processes initialize libvgpu.so concurrently #1662

@shizheng163

Description

@shizheng163

[Performance] High latency (~1 minute) when hundreds of processes initialize libvgpu.so concurrently

Summary

In a high-density vGPU deployment scenario, when hundreds of processes start simultaneously and initialize CUDA, each process experiences significant startup delay (up to 1 minute) due to lock contention in libvgpu.so. This severely impacts resource utilization and task throughput.

Environment

Component Version/Config
HAMi version v2.8.0
GPU 8x NVIDIA GPU (48GB each)
Kubernetes 1.32.4
Node density 40-50 Pods per node
Processes per Pod 4-5 child processes using vGPU
Total concurrent processes ~200-300 per node

Use Case Description

We run an inference service cluster with the following characteristics:

  • Each Pod requests a few GB of GPU memory (e.g., 2-4GB)
  • With 8 GPUs × 48GB each = 384GB total, we can run 40-50 Pods per node
  • Each Pod consumes tasks from a queue serially
  • When processing a task, the Pod spawns multiple child processes that all require GPU memory (vGPU)
  • This results in hundreds of processes initializing CUDA simultaneously on a single node

Problem

When a batch of tasks arrives, all Pods start processing and spawn child processes at roughly the same time. We observe:

  1. Startup command issued → process should start
  2. Actual process start → delayed by ~1 minute
  3. Root cause: All processes compete for the global unified_lock in /tmp/vgpulock/lock

Observed Behavior

# In container logs, we see repeated messages:
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
# ... continues for dozens of iterations
# On host, watching the lock file:
watch -n 0.5 'ls -la /tmp/vgpulock/'
# Lock file appears/disappears rapidly, showing high contention

Using LD_DEBUG to diagnose

time LD_DEBUG=libs,statistics ./our_binary
  • Dynamic library loading phase: fast (a few seconds)
  • initialize program phase: very slow (~1 minute)

The delay occurs during CUDA initialization inside libvgpu.so, not during library loading.

Impact

Metric Impact
Process startup latency Increased from <1s to ~60s
GPU utilization Decreased due to initialization queuing
Task throughput Significantly reduced
Resource efficiency Pods idle while waiting for lock

In a queue-based workload, this means:

  • Tasks pile up in the queue
  • GPU resources sit idle while processes wait for the lock
  • Overall cluster throughput drops dramatically

Root Cause Analysis

After analyzing the source code, we identified the following bottlenecks:

1. unified_lock implementation (libvgpu/src/utils.c)

const char* unified_lock = "/tmp/vgpulock/lock";
const int retry_count = 20;

int try_lock_unified_lock() {
    int fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    int cnt = 0;
    while (fd == -1 && cnt <= retry_count) {
        LOG_MSG("unified_lock locked, waiting 1 second...");
        sleep(rand() % 5 + 1);   // Random wait 1-5 seconds!
        cnt++;
        fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    }
    // ...
}

Issues:

  • Uses file existence (O_CREAT | O_EXCL) as lock mechanism
  • Sleep time is 1-5 seconds per retry (too long)
  • Maximum 20 retries = potential 100 seconds worst case
  • All containers on the node share this single lock via hostPath

2. Lock is acquired unconditionally (libvgpu/src/libvgpu.c)

void postInit() {
    allocator_init();
    map_cuda_visible_devices();
    
    try_lock_unified_lock();      // Called UNCONDITIONALLY
    nvmlReturn_t res = set_task_pid();
    try_unlock_unified_lock();
    
    // env_utilization_switch is set AFTER lock is released
    env_utilization_switch = set_env_utilization_switch();
}

Key finding: The lock acquisition happens before any environment variable checks. This means:

  • GPU_CORE_UTILIZATION_POLICY=disable does NOT skip the lock
  • disablecorelimit=true does NOT reduce startup lock contention
  • Only CUDA_DISABLE_CONTROL=true bypasses this, but it disables all vGPU features

3. Slow operations while holding the lock

set_task_pid() executes expensive NVML/CUDA operations while holding the lock:

  • nvmlInit()
  • nvmlDeviceGetComputeRunningProcesses() (iterates all GPUs)
  • cuDevicePrimaryCtxRetain() (CUDA context creation)

Suggested Improvements

Short-term fixes

  1. Reduce sleep time: Change sleep(rand() % 5 + 1) to usleep((rand() % 100 + 10) * 1000) (10-110ms)
  2. Use proper locking: Replace O_CREAT | O_EXCL with flock() which supports blocking wait
  3. Reduce retry count: 20 retries is excessive

Medium-term improvements

  1. Move slow operations outside critical section: Execute NVML/CUDA calls before acquiring lock, only update shared memory while holding lock
  2. Add configuration option: Allow users to disable the lock for memory-only isolation scenarios

Example of improved locking

// Use flock instead of file existence
int fd = open(unified_lock, O_CREAT | O_RDWR, 0666);
if (flock(fd, LOCK_EX) == 0) {  // Blocks efficiently without polling
    // Critical section
    flock(fd, LOCK_UN);
}
close(fd);

Related Issues

Questions

  1. Is there any planned improvement for the lock mechanism in upcoming releases?
  2. Is there a configuration option we might have missed that could help our scenario?
  3. Would a PR with the suggested improvements be welcome?

Thank you for your time and for the great work on HAMi!

中文内容:

[性能问题] 高并发场景下数百进程同时初始化 libvgpu.so 导致启动延迟约1分钟

问题概述

在高密度 vGPU 部署场景下,当数百个进程同时启动并初始化 CUDA 时,每个进程都会经历严重的启动延迟(最长可达 1 分钟),原因是 libvgpu.so 中的锁竞争。这严重影响了资源利用率和任务吞吐量。

环境信息

组件 版本/配置
HAMi 版本 v2.8.0
GPU 8 × NVIDIA GPU(每卡 48GB)
Kubernetes 1.32.4
单节点 Pod 密度 40-50 个 Pod
每 Pod 进程数 4-5 个子进程使用 vGPU
总并发进程数 每节点约 200-300 个

使用场景描述

我们运行推理服务集群,具有以下特点:

  • 每个 Pod 申请几 GB 的 GPU 显存(如 2-4GB)
  • 8 张 GPU × 每卡 48GB = 总共 384GB,可以运行 40-50 个 Pod
  • 每个 Pod 串行消费队列中的任务
  • 处理任务时,Pod 会启动多个子进程,这些子进程都需要 GPU 显存(vGPU)
  • 这导致单节点上同时有数百个进程在初始化 CUDA

问题表现

当一批任务到达时,所有 Pod 几乎同时开始处理并启动子进程。我们观察到:

  1. 启动命令下发 → 进程应该启动
  2. 实际进程启动 → 延迟约 1 分钟
  3. 根本原因:所有进程竞争 /tmp/vgpulock/lock 中的全局 unified_lock

观察到的现象

# 容器日志中看到重复的消息:
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
# ... 持续数十次迭代
# 在宿主机上监控锁文件:
watch -n 0.5 'ls -la /tmp/vgpulock/'
# 锁文件快速出现/消失,显示竞争激烈

使用 LD_DEBUG 诊断

time LD_DEBUG=libs,statistics ./our_binary
  • 动态库加载阶段:很快(几秒)
  • initialize program 阶段:非常慢(约 1 分钟以上)

延迟发生在 libvgpu.so 内部的 CUDA 初始化过程中,而不是库加载阶段。

影响

指标 影响
进程启动延迟 从 <1 秒增加到 约 60 秒
GPU 利用率 因初始化排队而下降
任务吞吐量 显著降低
资源效率 Pod 在等待锁时处于空闲状态

在基于队列的工作负载中,这意味着:

  • 任务在队列中堆积
  • GPU 资源在进程等待锁时处于空闲状态
  • 整体集群吞吐量大幅下降

根因分析

分析源码后,我们定位到以下瓶颈:

1. unified_lock 实现(libvgpu/src/utils.c

const char* unified_lock = "/tmp/vgpulock/lock";
const int retry_count = 20;

int try_lock_unified_lock() {
    int fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    int cnt = 0;
    while (fd == -1 && cnt <= retry_count) {
        LOG_MSG("unified_lock locked, waiting 1 second...");
        sleep(rand() % 5 + 1);   // 随机等待 1-5 秒!
        cnt++;
        fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    }
    // ...
}

问题点:

  • 使用文件存在性(O_CREAT | O_EXCL)作为锁机制
  • 每次重试的睡眠时间是 1-5 秒(太长)
  • 最多 20 次重试 = 最坏情况下 100 秒
  • 节点上所有容器通过 hostPath 共享这一把锁

2. 锁获取是无条件的(libvgpu/src/libvgpu.c

void postInit() {
    allocator_init();
    map_cuda_visible_devices();
    
    try_lock_unified_lock();      // 无条件调用
    nvmlReturn_t res = set_task_pid();
    try_unlock_unified_lock();
    
    // env_utilization_switch 在锁释放后才设置
    env_utilization_switch = set_env_utilization_switch();
}

关键发现: 锁获取发生在任何环境变量检查之前。这意味着:

  • GPU_CORE_UTILIZATION_POLICY=disable 不会跳过锁获取
  • disablecorelimit=true 不会减少启动时的锁竞争
  • 只有 CUDA_DISABLE_CONTROL=true 可以绕过,但它会禁用所有 vGPU 功能

3. 持锁期间执行耗时操作

set_task_pid() 在持有锁的情况下执行昂贵的 NVML/CUDA 操作:

  • nvmlInit()
  • nvmlDeviceGetComputeRunningProcesses()(遍历所有 GPU)
  • cuDevicePrimaryCtxRetain()(CUDA Context 创建)

改进建议

短期修复

  1. 减少睡眠时间:将 sleep(rand() % 5 + 1) 改为 usleep((rand() % 100 + 10) * 1000)(10-110 毫秒)
  2. 使用合适的锁机制:用 flock() 替代 O_CREAT | O_EXCL,支持阻塞等待
  3. 减少重试次数:20 次重试过多

中期改进

  1. 将耗时操作移出临界区:在获取锁之前执行 NVML/CUDA 调用,只在更新共享内存时持有锁
  2. 添加配置选项:允许用户在不需要某些功能时跳过锁

改进后的锁机制示例

// 使用 flock 替代文件存在性检查
int fd = open(unified_lock, O_CREAT | O_RDWR, 0666);
if (flock(fd, LOCK_EX) == 0) {  // 高效阻塞,无需轮询
    // 临界区
    flock(fd, LOCK_UN);
}
close(fd);

相关 Issue

问题

  1. 后续版本是否有计划改进锁机制?
  2. 是否有我们可能遗漏的配置选项可以帮助解决这个场景?
  3. 如果我们提交包含上述改进的 PR,是否会被接受?

感谢您的时间,也感谢 HAMi 团队的出色工作!

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions