[Performance] High latency (~1 minute) when hundreds of processes initialize libvgpu.so concurrently

# [Performance] High latency (~1 minute) when hundreds of processes initialize libvgpu.so concurrently

## Summary

In a high-density vGPU deployment scenario, when hundreds of processes start simultaneously and initialize CUDA, each process experiences significant startup delay (up to **1 minute**) due to lock contention in `libvgpu.so`. This severely impacts resource utilization and task throughput.

## Environment

| Component | Version/Config |
|-----------|----------------|
| HAMi version | v2.8.0 |
| GPU | 8x NVIDIA GPU (48GB each) |
| Kubernetes | 1.32.4 |
| Node density | 40-50 Pods per node |
| Processes per Pod | 4-5 child processes using vGPU |
| Total concurrent processes | ~200-300 per node |

## Use Case Description

We run an inference service cluster with the following characteristics:

- Each Pod requests a few GB of GPU memory (e.g., 2-4GB)
- With 8 GPUs × 48GB each = 384GB total, we can run **40-50 Pods** per node
- Each Pod consumes tasks from a queue **serially**
- When processing a task, the Pod spawns **multiple child processes** that all require GPU memory (vGPU)
- This results in **hundreds of processes** initializing CUDA simultaneously on a single node

## Problem

When a batch of tasks arrives, all Pods start processing and spawn child processes at roughly the same time. We observe:

1. **Startup command issued** → process should start
2. **Actual process start** → delayed by **~1 minute**
3. **Root cause**: All processes compete for the global `unified_lock` in `/tmp/vgpulock/lock`

### Observed Behavior

```bash
# In container logs, we see repeated messages:
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
# ... continues for dozens of iterations
```

```bash
# On host, watching the lock file:
watch -n 0.5 'ls -la /tmp/vgpulock/'
# Lock file appears/disappears rapidly, showing high contention
```

### Using LD_DEBUG to diagnose

```bash
time LD_DEBUG=libs,statistics ./our_binary
```

- Dynamic library loading phase: **fast** (a few seconds)
- `initialize program` phase: **very slow** (~1 minute)

The delay occurs during CUDA initialization inside `libvgpu.so`, not during library loading.

## Impact

| Metric | Impact |
|--------|--------|
| Process startup latency | Increased from <1s to **~60s** |
| GPU utilization | Decreased due to initialization queuing |
| Task throughput | Significantly reduced |
| Resource efficiency | Pods idle while waiting for lock |

In a queue-based workload, this means:
- Tasks pile up in the queue
- GPU resources sit idle while processes wait for the lock
- Overall cluster throughput drops dramatically

## Root Cause Analysis

After analyzing the source code, we identified the following bottlenecks:

### 1. `unified_lock` implementation (`libvgpu/src/utils.c`)

```c
const char* unified_lock = "/tmp/vgpulock/lock";
const int retry_count = 20;

int try_lock_unified_lock() {
    int fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    int cnt = 0;
    while (fd == -1 && cnt <= retry_count) {
        LOG_MSG("unified_lock locked, waiting 1 second...");
        sleep(rand() % 5 + 1);   // Random wait 1-5 seconds!
        cnt++;
        fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    }
    // ...
}
```

**Issues:**
- Uses file existence (`O_CREAT | O_EXCL`) as lock mechanism
- Sleep time is **1-5 seconds** per retry (too long)
- Maximum **20 retries** = potential **100 seconds** worst case
- All containers on the node share this single lock via hostPath

### 2. Lock is acquired unconditionally (`libvgpu/src/libvgpu.c`)

```c
void postInit() {
    allocator_init();
    map_cuda_visible_devices();
    
    try_lock_unified_lock();      // Called UNCONDITIONALLY
    nvmlReturn_t res = set_task_pid();
    try_unlock_unified_lock();
    
    // env_utilization_switch is set AFTER lock is released
    env_utilization_switch = set_env_utilization_switch();
}
```

**Key finding:** The lock acquisition happens **before** any environment variable checks. This means:
- `GPU_CORE_UTILIZATION_POLICY=disable` does NOT skip the lock
- `disablecorelimit=true` does NOT reduce startup lock contention
- Only `CUDA_DISABLE_CONTROL=true` bypasses this, but it **disables all vGPU features**

### 3. Slow operations while holding the lock

`set_task_pid()` executes expensive NVML/CUDA operations while holding the lock:
- `nvmlInit()`
- `nvmlDeviceGetComputeRunningProcesses()` (iterates all GPUs)
- `cuDevicePrimaryCtxRetain()` (CUDA context creation)

## Suggested Improvements

### Short-term fixes

1. **Reduce sleep time**: Change `sleep(rand() % 5 + 1)` to `usleep((rand() % 100 + 10) * 1000)` (10-110ms)
2. **Use proper locking**: Replace `O_CREAT | O_EXCL` with `flock()` which supports blocking wait
3. **Reduce retry count**: 20 retries is excessive

### Medium-term improvements

4. **Move slow operations outside critical section**: Execute NVML/CUDA calls before acquiring lock, only update shared memory while holding lock
5. **Add configuration option**: Allow users to disable the lock for memory-only isolation scenarios

### Example of improved locking

```c
// Use flock instead of file existence
int fd = open(unified_lock, O_CREAT | O_RDWR, 0666);
if (flock(fd, LOCK_EX) == 0) {  // Blocks efficiently without polling
    // Critical section
    flock(fd, LOCK_UN);
}
close(fd);
```

## Related Issues

- #976 - `unified_lock locked, waiting 1 second` (PyTorch distributed training)
- #696 - Repeated lock waiting messages
- #588 - gunicorn multi-worker deadlock

## Questions

1. Is there any planned improvement for the lock mechanism in upcoming releases?
2. Is there a configuration option we might have missed that could help our scenario?
3. Would a PR with the suggested improvements be welcome?

Thank you for your time and for the great work on HAMi!
---
中文内容：
# [性能问题] 高并发场景下数百进程同时初始化 libvgpu.so 导致启动延迟约1分钟

## 问题概述

在高密度 vGPU 部署场景下，当数百个进程同时启动并初始化 CUDA 时，每个进程都会经历严重的启动延迟（最长可达 **1 分钟**），原因是 `libvgpu.so` 中的锁竞争。这严重影响了资源利用率和任务吞吐量。

## 环境信息

| 组件 | 版本/配置 |
|------|----------|
| HAMi 版本 | v2.8.0 |
| GPU | 8 × NVIDIA GPU（每卡 48GB） |
| Kubernetes | 1.32.4 |
| 单节点 Pod 密度 | 40-50 个 Pod |
| 每 Pod 进程数 | 4-5 个子进程使用 vGPU |
| 总并发进程数 | 每节点约 200-300 个 |

## 使用场景描述

我们运行推理服务集群，具有以下特点：

- 每个 Pod 申请几 GB 的 GPU 显存（如 2-4GB）
- 8 张 GPU × 每卡 48GB = 总共 384GB，可以运行 **40-50 个 Pod**
- 每个 Pod **串行**消费队列中的任务
- 处理任务时，Pod 会启动**多个子进程**，这些子进程都需要 GPU 显存（vGPU）
- 这导致单节点上同时有**数百个进程**在初始化 CUDA

## 问题表现

当一批任务到达时，所有 Pod 几乎同时开始处理并启动子进程。我们观察到：

1. **启动命令下发** → 进程应该启动
2. **实际进程启动** → 延迟约 **1 分钟**
3. **根本原因**：所有进程竞争 `/tmp/vgpulock/lock` 中的全局 `unified_lock`

### 观察到的现象

```bash
# 容器日志中看到重复的消息：
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
[HAMI-core Msg]: unified_lock locked, waiting 1 second...
# ... 持续数十次迭代
```

```bash
# 在宿主机上监控锁文件：
watch -n 0.5 'ls -la /tmp/vgpulock/'
# 锁文件快速出现/消失，显示竞争激烈
```

### 使用 LD_DEBUG 诊断

```bash
time LD_DEBUG=libs,statistics ./our_binary
```

- 动态库加载阶段：**很快**（几秒）
- `initialize program` 阶段：**非常慢**（约 1 分钟以上）

延迟发生在 `libvgpu.so` 内部的 CUDA 初始化过程中，而不是库加载阶段。

## 影响

| 指标 | 影响 |
|------|------|
| 进程启动延迟 | 从 <1 秒增加到 **约 60 秒** |
| GPU 利用率 | 因初始化排队而下降 |
| 任务吞吐量 | 显著降低 |
| 资源效率 | Pod 在等待锁时处于空闲状态 |

在基于队列的工作负载中，这意味着：
- 任务在队列中堆积
- GPU 资源在进程等待锁时处于空闲状态
- 整体集群吞吐量大幅下降

## 根因分析

分析源码后，我们定位到以下瓶颈：

### 1. `unified_lock` 实现（`libvgpu/src/utils.c`）

```c
const char* unified_lock = "/tmp/vgpulock/lock";
const int retry_count = 20;

int try_lock_unified_lock() {
    int fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    int cnt = 0;
    while (fd == -1 && cnt <= retry_count) {
        LOG_MSG("unified_lock locked, waiting 1 second...");
        sleep(rand() % 5 + 1);   // 随机等待 1-5 秒！
        cnt++;
        fd = open(unified_lock, O_CREAT | O_EXCL, S_IRWXU);
    }
    // ...
}
```

**问题点：**
- 使用文件存在性（`O_CREAT | O_EXCL`）作为锁机制
- 每次重试的睡眠时间是 **1-5 秒**（太长）
- 最多 **20 次重试** = 最坏情况下 **100 秒**
- 节点上所有容器通过 hostPath 共享这一把锁

### 2. 锁获取是无条件的（`libvgpu/src/libvgpu.c`）

```c
void postInit() {
    allocator_init();
    map_cuda_visible_devices();
    
    try_lock_unified_lock();      // 无条件调用
    nvmlReturn_t res = set_task_pid();
    try_unlock_unified_lock();
    
    // env_utilization_switch 在锁释放后才设置
    env_utilization_switch = set_env_utilization_switch();
}
```

**关键发现：** 锁获取发生在任何环境变量检查**之前**。这意味着：
- `GPU_CORE_UTILIZATION_POLICY=disable` **不会**跳过锁获取
- `disablecorelimit=true` **不会**减少启动时的锁竞争
- 只有 `CUDA_DISABLE_CONTROL=true` 可以绕过，但它会**禁用所有 vGPU 功能**

### 3. 持锁期间执行耗时操作

`set_task_pid()` 在持有锁的情况下执行昂贵的 NVML/CUDA 操作：
- `nvmlInit()`
- `nvmlDeviceGetComputeRunningProcesses()`（遍历所有 GPU）
- `cuDevicePrimaryCtxRetain()`（CUDA Context 创建）

## 改进建议

### 短期修复

1. **减少睡眠时间**：将 `sleep(rand() % 5 + 1)` 改为 `usleep((rand() % 100 + 10) * 1000)`（10-110 毫秒）
2. **使用合适的锁机制**：用 `flock()` 替代 `O_CREAT | O_EXCL`，支持阻塞等待
3. **减少重试次数**：20 次重试过多

### 中期改进

4. **将耗时操作移出临界区**：在获取锁之前执行 NVML/CUDA 调用，只在更新共享内存时持有锁
5. **添加配置选项**：允许用户在不需要某些功能时跳过锁

### 改进后的锁机制示例

```c
// 使用 flock 替代文件存在性检查
int fd = open(unified_lock, O_CREAT | O_RDWR, 0666);
if (flock(fd, LOCK_EX) == 0) {  // 高效阻塞，无需轮询
    // 临界区
    flock(fd, LOCK_UN);
}
close(fd);
```

## 相关 Issue

- #976 - `unified_lock locked, waiting 1 second`（PyTorch 分布式训练）
- #696 - 重复的锁等待消息
- #588 - gunicorn 多 worker 死锁

## 问题

1. 后续版本是否有计划改进锁机制？
2. 是否有我们可能遗漏的配置选项可以帮助解决这个场景？
3. 如果我们提交包含上述改进的 PR，是否会被接受？

感谢您的时间，也感谢 HAMi 团队的出色工作！


Component	Version/Config
HAMi version	v2.8.0
GPU	8x NVIDIA GPU (48GB each)
Kubernetes	1.32.4
Node density	40-50 Pods per node
Processes per Pod	4-5 child processes using vGPU
Total concurrent processes	~200-300 per node

Metric	Impact
Process startup latency	Increased from <1s to ~60s
GPU utilization	Decreased due to initialization queuing
Task throughput	Significantly reduced
Resource efficiency	Pods idle while waiting for lock

组件	版本/配置
HAMi 版本	v2.8.0
GPU	8 × NVIDIA GPU（每卡 48GB）
Kubernetes	1.32.4
单节点 Pod 密度	40-50 个 Pod
每 Pod 进程数	4-5 个子进程使用 vGPU
总并发进程数	每节点约 200-300 个

指标	影响
进程启动延迟	从 <1 秒增加到约 60 秒
GPU 利用率	因初始化排队而下降
任务吞吐量	显著降低
资源效率	Pod 在等待锁时处于空闲状态

[Performance] High latency (~1 minute) when hundreds of processes initialize libvgpu.so concurrently #1662

Description

[Performance] High latency (~1 minute) when hundreds of processes initialize libvgpu.so concurrently

Summary

Environment

Use Case Description

Problem

Observed Behavior

Using LD_DEBUG to diagnose

Impact

Root Cause Analysis

1. unified_lock implementation (libvgpu/src/utils.c)

2. Lock is acquired unconditionally (libvgpu/src/libvgpu.c)

3. Slow operations while holding the lock

Suggested Improvements

Short-term fixes

Medium-term improvements

Example of improved locking

Related Issues

Questions

Thank you for your time and for the great work on HAMi!

[性能问题] 高并发场景下数百进程同时初始化 libvgpu.so 导致启动延迟约1分钟

问题概述

环境信息

使用场景描述

问题表现

观察到的现象

使用 LD_DEBUG 诊断

影响

根因分析

1. unified_lock 实现（libvgpu/src/utils.c）

2. 锁获取是无条件的（libvgpu/src/libvgpu.c）

3. 持锁期间执行耗时操作

改进建议

短期修复

中期改进

改进后的锁机制示例

相关 Issue

问题

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `unified_lock` implementation (`libvgpu/src/utils.c`)

2. Lock is acquired unconditionally (`libvgpu/src/libvgpu.c`)

1. `unified_lock` 实现（`libvgpu/src/utils.c`）

2. 锁获取是无条件的（`libvgpu/src/libvgpu.c`）