merge pr165 commits for Unoptimized library implementation causing CUDA API slow by maverick123123 · Pull Request #166 · Project-HAMi/HAMi-core

maverick123123 · 2026-03-25T13:18:01Z

HAMi-core Performance Optimizations

Background

The HAMi-core CUDA library hijacking layer was introducing ~23% overhead
on training workloads. Profiling identified the overhead in per-call work
performed by intercepted CUDA functions — not in the actual resource
limiting logic, but in bookkeeping (logging, status checks, shared memory
reads) that executed on every CUDA API call regardless of whether limiting
was active.

This document describes the P0 (critical) and P1 (high-impact)
optimizations applied to reduce this overhead.

Commit 1 — [P0] Cache log level (`log_utils.h`, `utils.c`, `libvgpu.c`)

Problem

Every LOG_DEBUG, LOG_INFO, LOG_WARN, and LOG_MSG macro called
getenv("LIBCUDA_LOG_LEVEL") and atoi() on every invocation.
getenv() performs a linear scan of the environment block. These macros
appear in every intercepted CUDA function, so this overhead accumulated
across thousands of calls per second.

Fix

Added int g_log_level global variable (default: 2 = warn level).
Added log_utils_init() function that reads LIBCUDA_LOG_LEVEL once.
Rewrote all LOG_* macros to check g_log_level (a single integer
comparison) instead of calling getenv() + atoi().
log_utils_init() is called from preInit() in libvgpu.c.

Behavior preserved

Default log level (env unset) remains 2 (WARN/MSG/ERROR), matching the
original behavior where LOG_WARN/LOG_MSG logged when env was NULL.
LOG_ERROR remains unconditional (always emitted).
LIBCUDA_LOG_LEVEL env var still controls log level; it is simply
read once at library load time.

Testing

Set LIBCUDA_LOG_LEVEL=4 and verify DEBUG output appears.
Set LIBCUDA_LOG_LEVEL=0 and verify only ERROR output appears.
Unset LIBCUDA_LOG_LEVEL and verify WARN/MSG/ERROR output appears
(same as before).

Commit 2 — [P0] Use cached slot in `wait_status_self()` (`multiprocess_memory_limit.c`)

Problem

wait_status_self() is called via the ENSURE_RUNNING() macro on every
kernel launch and memory operation. It performed a linear scan through
all process slots (up to 1024 entries), comparing PIDs via getpid() to
find the current process's status field. This was O(n) per call.

Fix

Use the already-cached region_info.my_slot pointer (set during
init_proc_slot_withlock() at startup) for O(1) direct access.
Fall back to linear scan only if my_slot is NULL (defensive, should
not happen after initialization).
Used proper atomic_load_explicit with memory_order_acquire for
reading shared-memory fields.

Behavior preserved

Returns the same values as before: 1 if status matches, 0 if not, -1
if process not found.
The slow path (linear scan) is identical to the original logic.

Testing

Run any CUDA workload with HAMi — ENSURE_RUNNING() is exercised on
every kernel launch and memory allocation.

Commit 3 — [P1] Optimize `pre_launch_kernel()` (`multiprocess_memory_limit.c`)

Problem

pre_launch_kernel() runs on every cuLaunchKernel call. It was:

Calling time(NULL) — a syscall into the kernel.
Always acquiring pthread_mutex_lock even when the timestamp had not
changed (recording interval is 1 second, kernels fire thousands/sec).

Fix

Replaced time(NULL) with clock_gettime(CLOCK_REALTIME_COARSE),
which is served from the Linux vDSO (no syscall). Resolution is ~1-4ms,
which is irrelevant for a 1-second recording interval. Uses
CLOCK_REALTIME_COARSE (not MONOTONIC) to preserve epoch-time
semantics for dump_shrreg and other consumers.
Added double-checked locking: check the timestamp before acquiring
the mutex. The fast path (>99.99% of calls) becomes a single memory
read + integer comparison. The mutex is only taken when an update is
actually needed (~once per second).

Correctness notes

The unlocked read of region_info.last_kernel_time is safe: uint64_t
reads are atomic on x86-64 and aarch64 (aligned). A torn read would at
worst cause one unnecessary mutex acquisition, not incorrect behavior.
The atomic CAS update to the shared region is unchanged.

Testing

Run dump_shrreg tool while a CUDA workload is active — verify
last_kernel_time still updates correctly (once per second).
Run a kernel-intensive workload and compare throughput with/without
this change.

Commit 4 — [P1] Optimize `rate_limiter()` (`multiprocess_utilization_watcher.c`)

Problem

rate_limiter() runs on every kernel launch when pidfound==1. Before
reaching the actual rate-limiting CAS loop, it performed:

get_recent_kernel() — shared memory read
set_recent_kernel(2) — shared memory write (always writing 2, which
was already the value — a no-op that dirtied a cross-process cache line)
get_current_device_sm_limit(0) — called twice (redundant)
get_utilization_switch() — shared memory read

That is 3 shared memory reads + 1 write + 2 ensure_initialized() calls
on every kernel launch, even when rate limiting was inactive.

Fix

Cache sm_limit and utilization_switch in static locals during
init_utilization_watcher(). These values are set at container startup
and do not change at runtime.
Fast-exit check uses cached locals: when limiting is inactive
(sm_limit >= 100 or == 0), rate_limiter returns after a single
branch on a local variable.
Removed set_recent_kernel(2) — eliminated the shared memory write.
Removed the duplicate get_current_device_sm_limit(0) call.
Reduced sleep(1) to usleep(1000) in the defensive recent_kernel
guard (currently unreachable but safer if triggered externally).
The CAS spin loop and 10ms nanosleep backoff are unchanged,
preserving correct rate-limiting when 0 < sm_limit < 100.

Behavior preserved

When SM limiting is active (0 < sm_limit < 100), the token-bucket
mechanism works identically — the only difference is reaching the CAS
loop faster (fewer shared memory reads before it).
The utilization watcher thread is still created under the same
conditions.

Testing

With CUDA_DEVICE_SM_LIMIT=50: verify SM utilization is capped as
before. Run a compute-heavy workload and confirm utilization stays
near the configured limit.
With CUDA_DEVICE_SM_LIMIT=100 (or unset): verify no rate limiting
occurs and kernel throughput matches native (no HAMi) baseline.

Commit 5 — [P1] Remove dead `cuDeviceGetCount` in `oom_check()` (`allocator.c`)

Problem

oom_check() called cuDeviceGetCount() on every memory allocation,
storing the result in count1 — which was never read. This was a wasted
CUDA driver API call on every allocation.

Fix

Removed the dead cuDeviceGetCount call and the unused count1
variable. The function only needs the specific device ID passed via the
dev parameter, not the total device count.

Testing

Run memory allocation tests (test_alloc, test_runtime_alloc, etc.)
to verify OOM checking still works correctly.

Expected Impact

Change	Per-call overhead removed	Frequency
Cached log level	`getenv()` + `atoi()` per LOG macro	Every CUDA call
Cached `my_slot` in `wait_status_self`	O(n) linear scan of process slots	Every kernel launch + memory op
vDSO clock + double-checked lock	`time()` syscall + mutex lock/unlock	Every kernel launch
Cached rate_limiter limits	3 shared mem reads + 1 write	Every kernel launch
Remove dead `cuDeviceGetCount`	1 driver API call	Every memory allocation

Combined, these changes should reduce the hijacking overhead from ~23%
to under 5% for typical training workloads.

How to benchmark

# Baseline (no HAMi):
python ../semantic-id-recsys/semantic-id-training/test_hami_slowdown.py

# With HAMi (original):
LD_PRELOAD=/path/to/original/libvgpu.so \
  python ../semantic-id-recsys/semantic-id-training/test_hami_slowdown.py

# With HAMi (optimized):
LD_PRELOAD=/path/to/optimized/libvgpu.so \
  python ../semantic-id-recsys/semantic-id-training/test_hami_slowdown.py

Compare wall-clock time and throughput (samples/sec) across the three runs.

Merge pr155 some important commits

hami-robot · 2026-03-30T10:24:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maverick123123
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Every LOG_DEBUG/LOG_INFO/LOG_WARN/LOG_MSG macro was calling getenv("LIBCUDA_LOG_LEVEL") and atoi() on every invocation. On hot paths (kernel launch, memory alloc/free), this added measurable overhead from repeated linear scans of the environment block. Changes: - Add g_log_level global (default 2 = warn, matching original behavior when LIBCUDA_LOG_LEVEL is unset) - Add log_utils_init() to read the env var once at startup - Rewrite all LOG_* macros to check g_log_level instead of getenv() - Call log_utils_init() from preInit() in libvgpu.c The log level can still be controlled via LIBCUDA_LOG_LEVEL env var; it is simply read once at library load time instead of on every log statement. LOG_ERROR remains unconditional (always emitted). Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

wait_status_self() is called via the ENSURE_RUNNING() macro on every kernel launch and memory operation. It was doing a linear scan through all process slots (up to 1024) comparing PIDs to find the current process's status — O(n) per call. The process slot pointer is already cached in region_info.my_slot during init_proc_slot_withlock(). Use it directly for an O(1) fast path. The linear scan is preserved as a fallback for the edge case where my_slot has not yet been initialized. Also switched to proper atomic loads with acquire semantics for reading shared-memory fields in the slow path, consistent with the rest of the codebase. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

…ocking in pre_launch_kernel() pre_launch_kernel() is called on every cuLaunchKernel invocation. It was calling time(NULL) — a syscall — and always acquiring pthread_mutex_lock even when the timestamp hadn't changed. Changes: - Replace time(NULL) with clock_gettime(CLOCK_REALTIME_COARSE), which is served from the Linux vDSO (no syscall, ~1-4ms resolution). This is a safe drop-in: the recording interval is 1 second, so millisecond jitter is irrelevant. CLOCK_REALTIME_COARSE gives epoch time like time(), so dump_shrreg and other consumers are unaffected. - Add double-checked locking: check the timestamp before acquiring the mutex. On the fast path (>99.99% of calls, since kernels fire thousands/sec but interval is 1s), this becomes a single memory read + integer comparison — no syscall, no mutex. - The unlocked read of region_info.last_kernel_time is safe: uint64_t reads are atomic on x86-64 and aarch64 (aligned), and a torn read would at worst cause one extra mutex acquisition. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

…t shared memory ops rate_limiter() is called on every kernel launch when pidfound==1. It was performing 3 shared memory reads, 1 shared memory write, and 2 ensure_initialized() calls before reaching the actual rate-limiting CAS loop — all unnecessary per-call overhead. Changes: - Cache sm_limit and utilization_switch in static locals during init_utilization_watcher(). These values are set at container startup and do not change at runtime. - Use cached values for the fast-exit check instead of reading from shared memory on every call. When limiting is inactive (sm_limit >= 100 or == 0), rate_limiter becomes a single branch on a local variable. - Remove set_recent_kernel(2) — it unconditionally wrote 2 to shared memory, but the value was already 2 (set at init and never changed to anything else). This dirtied a cross-process cache line on every kernel launch for no effect. - Remove duplicate get_current_device_sm_limit(0) call (was called twice with identical arguments). - Reduce sleep(1) to usleep(1000) in the defensive recent_kernel guard (unreachable in current codebase, but safer if triggered). - The actual CAS spin loop and 10ms nanosleep backoff are unchanged, preserving correct rate-limiting behavior when 0 < sm_limit < 100. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

oom_check() called cuDeviceGetCount() on every memory allocation, but never used the result — the variable count1 was written to and then discarded. This was a wasted driver API call on every alloc. Remove the dead call entirely. The device count does not change at runtime and is not needed by this function's logic, which only operates on the specific device passed via the dev parameter. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

Document all P0 and P1 changes for reducing CUDA hijacking overhead: - Problem description, fix rationale, behavior preservation notes, and testing guidance for each commit. - Expected impact summary table. - Benchmarking instructions. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

g_log_level, fp1, and log_utils_init() were defined in utils.c, which is only compiled into libvgpu.so. The shrreg-tool executable links multiprocess_mod (which uses LOG_* macros referencing g_log_level) but does not link utils.o, causing undefined reference errors. Fix: extract these definitions into a standalone log_utils.c and add it to both the main libvgpu library and the shrreg-tool executable in the CMake build files. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

The previous optimization moved the sm_limit fast-exit before the get_recent_kernel()/set_recent_kernel(2) calls, which meant the "GPU is active" signal was no longer written when SM limiting was disabled. While current code only reads recent_kernel within rate_limiter itself, the shared memory field could be observed by external tooling or future features for OOM decisions or memory accounting. Restore the original call order: recent_kernel read/write happens unconditionally on every call, then the cached sm_limit/util_switch check determines whether to proceed to the CAS rate-limiting loop. The remaining optimizations are preserved: - Cached sm_limit/utilization_switch (eliminates 3 shared memory reads + 2 ensure_initialized calls) - Reduced sleep(1) to usleep(1000) in the defensive guard - Removed duplicate get_current_device_sm_limit(0) call Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

…er()" This reverts commit bfea6e1. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

… behavior Restore the unconditional set_recent_kernel(2) call that was removed in the rate_limiter optimization. The write has negligible cost (~100- 200ns cache line store) compared to the other savings in this function, and removing it changes observable shared memory state which could affect external tooling or future features. The call is placed after the cached sm_limit/util_switch fast-exit, matching the original position relative to the get_recent_kernel() guard. All other optimizations (cached limits, removed duplicate sm_limit call, reduced sleep) are preserved. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

hami-robot bot added the dco-signoff: yes label Mar 25, 2026

hami-robot bot requested review from archlitchi and chaunceyjiang March 25, 2026 13:18

hami-robot bot added size/S dco-signoff: no size/L and removed dco-signoff: yes size/S labels Mar 25, 2026

maverick123123 force-pushed the fix/rate_limiter branch 2 times, most recently from d91b368 to d05666f Compare March 30, 2026 09:26

hami-robot bot added size/XXL and removed size/L labels Mar 30, 2026

maverick123123 force-pushed the fix/rate_limiter branch from d05666f to b9ac056 Compare March 30, 2026 09:53

hami-robot bot added dco-signoff: yes size/XS dco-signoff: no size/L and removed dco-signoff: no size/XXL dco-signoff: yes size/XS labels Mar 30, 2026

maverick123123 force-pushed the fix/rate_limiter branch 2 times, most recently from b9e36cf to ecc4501 Compare March 30, 2026 10:17

hami-robot bot added dco-signoff: yes size/XXL and removed dco-signoff: no size/L labels Mar 30, 2026

Merge pull request Project-HAMi#163 from maverick123123/testpr155

c83fc4c

Merge pr155 some important commits

maverick123123 force-pushed the fix/rate_limiter branch from ecc4501 to c83fc4c Compare March 30, 2026 10:22

hami-robot bot removed the size/XXL label Mar 30, 2026

hami-robot bot added size/XS dco-signoff: no and removed dco-signoff: yes labels Mar 30, 2026

hami-robot bot added size/L and removed size/XS labels Mar 30, 2026

maverick123123 force-pushed the fix/rate_limiter branch from c692fb5 to bfb09df Compare March 30, 2026 10:33

nishitnshah added 9 commits March 30, 2026 18:35

Revert "fix: restore unconditional set_recent_kernel(2) in rate_limit…

b1d05d0

…er()" This reverts commit bfea6e1. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

maverick123123 force-pushed the fix/rate_limiter branch from bfb09df to 5afd3cd Compare March 30, 2026 10:35

hami-robot bot added dco-signoff: yes and removed dco-signoff: no labels Mar 30, 2026

maverick123123 force-pushed the fix/rate_limiter branch from 5afd3cd to 407e5cb Compare March 30, 2026 10:37

maverick123123 force-pushed the fix/rate_limiter branch from 407e5cb to 66e7ede Compare March 30, 2026 10:48

maverick123123 changed the title ~~merge pr165 optimize rate_limiter() two fixs~~ merge pr165 commits for Unoptimized library implementation causing CUDA API slow Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge pr165 commits for Unoptimized library implementation causing CUDA API slow#166

merge pr165 commits for Unoptimized library implementation causing CUDA API slow#166
maverick123123 wants to merge 11 commits intoProject-HAMi:mainfrom
maverick123123:fix/rate_limiter

maverick123123 commented Mar 25, 2026 •

edited

Loading

Uh oh!

hami-robot bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maverick123123 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HAMi-core Performance Optimizations

Background

Commit 1 — [P0] Cache log level (log_utils.h, utils.c, libvgpu.c)

Problem

Fix

Behavior preserved

Testing

Commit 2 — [P0] Use cached slot in wait_status_self() (multiprocess_memory_limit.c)

Problem

Fix

Behavior preserved

Testing

Commit 3 — [P1] Optimize pre_launch_kernel() (multiprocess_memory_limit.c)

Problem

Fix

Correctness notes

Testing

Commit 4 — [P1] Optimize rate_limiter() (multiprocess_utilization_watcher.c)

Problem

Fix

Behavior preserved

Testing

Commit 5 — [P1] Remove dead cuDeviceGetCount in oom_check() (allocator.c)

Problem

Fix

Testing

Expected Impact

How to benchmark

Uh oh!

hami-robot bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maverick123123 commented Mar 25, 2026 •

edited

Loading

Commit 1 — [P0] Cache log level (`log_utils.h`, `utils.c`, `libvgpu.c`)

Commit 2 — [P0] Use cached slot in `wait_status_self()` (`multiprocess_memory_limit.c`)

Commit 3 — [P1] Optimize `pre_launch_kernel()` (`multiprocess_memory_limit.c`)

Commit 4 — [P1] Optimize `rate_limiter()` (`multiprocess_utilization_watcher.c`)

Commit 5 — [P1] Remove dead `cuDeviceGetCount` in `oom_check()` (`allocator.c`)