-
Notifications
You must be signed in to change notification settings - Fork 488
Description
[Performance] High latency (~1 minute) when hundreds of processes initialize libvgpu.so concurrently
Summary
In a high-density vGPU deployment scenario, when hundreds of processes start simultaneously and initialize CUDA, each process experiences significant startup delay (up to 1 minute) due to lock contention in libvgpu.so. This severely impacts resource utilization and task throughput.
Environment
| Component | Version/Config |
|---|---|
| HAMi version | v2.8.0 |
| GPU | 8x NVIDIA GPU (48GB each) |
| Kubernetes | 1.32.4 |
| Node density | 40-50 Pods per node |
| Processes per Pod | 4-5 child processes using vGPU |
| Total concurrent processes | ~200-300 per node |
Use Case Description
We run an inference service cluster with the following characteristics:
- Each Pod requests a few GB of GPU memory (e.g., 2-4GB)
- With 8 GPUs × 48GB each = 384GB total, we can run 40-50 Pods per node
- Each Pod consumes tasks from a queue serially
- When processing a task, the Pod spawns multiple child processes that all require GPU memory (vGPU)
- This results in hundreds of processes initializing CUDA simultaneously on a single node
Problem
When a batch of tasks arrives, all Pods start processing and spawn child processes at roughly the same time. We observe:
- Startup command issued → process should start
- Actual process start → delayed by ~1 minute
- Root cause: All processes compete for the global
unified_lockin/tmp/vgpulock/lock
Observed Behavior
# In container logs, we see repeated messages:
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
# ... continues for dozens of iterations# On host, watching the lock file:
watch -n 0.5 'ls -la /tmp/vgpulock/'
# Lock file appears/disappears rapidly, showing high contentionUsing LD_DEBUG to diagnose
time LD_DEBUG=libs,statistics ./our_binary- Dynamic library loading phase: fast (a few seconds)
initialize programphase: very slow (~1 minute)
The delay occurs during CUDA initialization inside libvgpu.so, not during library loading.
Impact
| Metric | Impact |
|---|---|
| Process startup latency | Increased from <1s to ~60s |
| GPU utilization | Decreased due to initialization queuing |
| Task throughput | Significantly reduced |
| Resource efficiency | Pods idle while waiting for lock |
In a queue-based workload, this means:
- Tasks pile up in the queue
- GPU resources sit idle while processes wait for the lock
- Overall cluster throughput drops dramatically
Root Cause Analysis
After analyzing the source code, we identified the following bottlenecks:
1. unified_lock implementation (libvgpu/src/utils.c)
const char* unified_lock = "/tmp/vgpulock/lock";
const int retry_count = 20;
int try_lock_unified_lock() {
int fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
int cnt = 0;
while (fd == -1 && cnt <= retry_count) {
LOG_MSG("unified_lock locked, waiting 1 second...");
sleep(rand() % 5 + 1); // Random wait 1-5 seconds!
cnt++;
fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
}
// ...
}Issues:
- Uses file existence (
O_CREAT | O_EXCL) as lock mechanism - Sleep time is 1-5 seconds per retry (too long)
- Maximum 20 retries = potential 100 seconds worst case
- All containers on the node share this single lock via hostPath
2. Lock is acquired unconditionally (libvgpu/src/libvgpu.c)
void postInit() {
allocator_init();
map_cuda_visible_devices();
try_lock_unified_lock(); // Called UNCONDITIONALLY
nvmlReturn_t res = set_task_pid();
try_unlock_unified_lock();
// env_utilization_switch is set AFTER lock is released
env_utilization_switch = set_env_utilization_switch();
}Key finding: The lock acquisition happens before any environment variable checks. This means:
GPU_CORE_UTILIZATION_POLICY=disabledoes NOT skip the lockdisablecorelimit=truedoes NOT reduce startup lock contention- Only
CUDA_DISABLE_CONTROL=truebypasses this, but it disables all vGPU features
3. Slow operations while holding the lock
set_task_pid() executes expensive NVML/CUDA operations while holding the lock:
nvmlInit()nvmlDeviceGetComputeRunningProcesses()(iterates all GPUs)cuDevicePrimaryCtxRetain()(CUDA context creation)
Suggested Improvements
Short-term fixes
- Reduce sleep time: Change
sleep(rand() % 5 + 1)tousleep((rand() % 100 + 10) * 1000)(10-110ms) - Use proper locking: Replace
O_CREAT | O_EXCLwithflock()which supports blocking wait - Reduce retry count: 20 retries is excessive
Medium-term improvements
- Move slow operations outside critical section: Execute NVML/CUDA calls before acquiring lock, only update shared memory while holding lock
- Add configuration option: Allow users to disable the lock for memory-only isolation scenarios
Example of improved locking
// Use flock instead of file existence
int fd = open(unified_lock, O_CREAT | O_RDWR, 0666);
if (flock(fd, LOCK_EX) == 0) { // Blocks efficiently without polling
// Critical section
flock(fd, LOCK_UN);
}
close(fd);Related Issues
- unified_lock locked, waiting 1 second #976 -
unified_lock locked, waiting 1 second(PyTorch distributed training) - 部署HAMi-core后,执行测试案例,一直提示unified_lock locked, waiting 1 second... #696 - Repeated lock waiting messages
- gunicorn多worker情况下,使用vgpu出现死锁 #588 - gunicorn multi-worker deadlock
Questions
- Is there any planned improvement for the lock mechanism in upcoming releases?
- Is there a configuration option we might have missed that could help our scenario?
- Would a PR with the suggested improvements be welcome?
Thank you for your time and for the great work on HAMi!
中文内容:
[性能问题] 高并发场景下数百进程同时初始化 libvgpu.so 导致启动延迟约1分钟
问题概述
在高密度 vGPU 部署场景下,当数百个进程同时启动并初始化 CUDA 时,每个进程都会经历严重的启动延迟(最长可达 1 分钟),原因是 libvgpu.so 中的锁竞争。这严重影响了资源利用率和任务吞吐量。
环境信息
| 组件 | 版本/配置 |
|---|---|
| HAMi 版本 | v2.8.0 |
| GPU | 8 × NVIDIA GPU(每卡 48GB) |
| Kubernetes | 1.32.4 |
| 单节点 Pod 密度 | 40-50 个 Pod |
| 每 Pod 进程数 | 4-5 个子进程使用 vGPU |
| 总并发进程数 | 每节点约 200-300 个 |
使用场景描述
我们运行推理服务集群,具有以下特点:
- 每个 Pod 申请几 GB 的 GPU 显存(如 2-4GB)
- 8 张 GPU × 每卡 48GB = 总共 384GB,可以运行 40-50 个 Pod
- 每个 Pod 串行消费队列中的任务
- 处理任务时,Pod 会启动多个子进程,这些子进程都需要 GPU 显存(vGPU)
- 这导致单节点上同时有数百个进程在初始化 CUDA
问题表现
当一批任务到达时,所有 Pod 几乎同时开始处理并启动子进程。我们观察到:
- 启动命令下发 → 进程应该启动
- 实际进程启动 → 延迟约 1 分钟
- 根本原因:所有进程竞争
/tmp/vgpulock/lock中的全局unified_lock
观察到的现象
# 容器日志中看到重复的消息:
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
# ... 持续数十次迭代# 在宿主机上监控锁文件:
watch -n 0.5 'ls -la /tmp/vgpulock/'
# 锁文件快速出现/消失,显示竞争激烈使用 LD_DEBUG 诊断
time LD_DEBUG=libs,statistics ./our_binary- 动态库加载阶段:很快(几秒)
initialize program阶段:非常慢(约 1 分钟以上)
延迟发生在 libvgpu.so 内部的 CUDA 初始化过程中,而不是库加载阶段。
影响
| 指标 | 影响 |
|---|---|
| 进程启动延迟 | 从 <1 秒增加到 约 60 秒 |
| GPU 利用率 | 因初始化排队而下降 |
| 任务吞吐量 | 显著降低 |
| 资源效率 | Pod 在等待锁时处于空闲状态 |
在基于队列的工作负载中,这意味着:
- 任务在队列中堆积
- GPU 资源在进程等待锁时处于空闲状态
- 整体集群吞吐量大幅下降
根因分析
分析源码后,我们定位到以下瓶颈:
1. unified_lock 实现(libvgpu/src/utils.c)
const char* unified_lock = "/tmp/vgpulock/lock";
const int retry_count = 20;
int try_lock_unified_lock() {
int fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
int cnt = 0;
while (fd == -1 && cnt <= retry_count) {
LOG_MSG("unified_lock locked, waiting 1 second...");
sleep(rand() % 5 + 1); // 随机等待 1-5 秒!
cnt++;
fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
}
// ...
}问题点:
- 使用文件存在性(
O_CREAT | O_EXCL)作为锁机制 - 每次重试的睡眠时间是 1-5 秒(太长)
- 最多 20 次重试 = 最坏情况下 100 秒
- 节点上所有容器通过 hostPath 共享这一把锁
2. 锁获取是无条件的(libvgpu/src/libvgpu.c)
void postInit() {
allocator_init();
map_cuda_visible_devices();
try_lock_unified_lock(); // 无条件调用
nvmlReturn_t res = set_task_pid();
try_unlock_unified_lock();
// env_utilization_switch 在锁释放后才设置
env_utilization_switch = set_env_utilization_switch();
}关键发现: 锁获取发生在任何环境变量检查之前。这意味着:
GPU_CORE_UTILIZATION_POLICY=disable不会跳过锁获取disablecorelimit=true不会减少启动时的锁竞争- 只有
CUDA_DISABLE_CONTROL=true可以绕过,但它会禁用所有 vGPU 功能
3. 持锁期间执行耗时操作
set_task_pid() 在持有锁的情况下执行昂贵的 NVML/CUDA 操作:
nvmlInit()nvmlDeviceGetComputeRunningProcesses()(遍历所有 GPU)cuDevicePrimaryCtxRetain()(CUDA Context 创建)
改进建议
短期修复
- 减少睡眠时间:将
sleep(rand() % 5 + 1)改为usleep((rand() % 100 + 10) * 1000)(10-110 毫秒) - 使用合适的锁机制:用
flock()替代O_CREAT | O_EXCL,支持阻塞等待 - 减少重试次数:20 次重试过多
中期改进
- 将耗时操作移出临界区:在获取锁之前执行 NVML/CUDA 调用,只在更新共享内存时持有锁
- 添加配置选项:允许用户在不需要某些功能时跳过锁
改进后的锁机制示例
// 使用 flock 替代文件存在性检查
int fd = open(unified_lock, O_CREAT | O_RDWR, 0666);
if (flock(fd, LOCK_EX) == 0) { // 高效阻塞,无需轮询
// 临界区
flock(fd, LOCK_UN);
}
close(fd);相关 Issue
- unified_lock locked, waiting 1 second #976 -
unified_lock locked, waiting 1 second(PyTorch 分布式训练) - 部署HAMi-core后,执行测试案例,一直提示unified_lock locked, waiting 1 second... #696 - 重复的锁等待消息
- gunicorn多worker情况下,使用vgpu出现死锁 #588 - gunicorn 多 worker 死锁
问题
- 后续版本是否有计划改进锁机制?
- 是否有我们可能遗漏的配置选项可以帮助解决这个场景?
- 如果我们提交包含上述改进的 PR,是否会被接受?
感谢您的时间,也感谢 HAMi 团队的出色工作!