merge pr165 commits for Unoptimized library implementation causing CUDA API slow#166
Open
maverick123123 wants to merge 11 commits intoProject-HAMi:mainfrom
Open
merge pr165 commits for Unoptimized library implementation causing CUDA API slow#166maverick123123 wants to merge 11 commits intoProject-HAMi:mainfrom
maverick123123 wants to merge 11 commits intoProject-HAMi:mainfrom
Conversation
d91b368 to
d05666f
Compare
d05666f to
b9ac056
Compare
b9e36cf to
ecc4501
Compare
Merge pr155 some important commits
ecc4501 to
c83fc4c
Compare
Contributor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: maverick123123 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
c692fb5 to
bfb09df
Compare
Every LOG_DEBUG/LOG_INFO/LOG_WARN/LOG_MSG macro was calling
getenv("LIBCUDA_LOG_LEVEL") and atoi() on every invocation. On hot
paths (kernel launch, memory alloc/free), this added measurable
overhead from repeated linear scans of the environment block.
Changes:
- Add g_log_level global (default 2 = warn, matching original behavior
when LIBCUDA_LOG_LEVEL is unset)
- Add log_utils_init() to read the env var once at startup
- Rewrite all LOG_* macros to check g_log_level instead of getenv()
- Call log_utils_init() from preInit() in libvgpu.c
The log level can still be controlled via LIBCUDA_LOG_LEVEL env var;
it is simply read once at library load time instead of on every log
statement. LOG_ERROR remains unconditional (always emitted).
Signed-off-by nishitnshah <nishshah@linkedin.com>
Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
wait_status_self() is called via the ENSURE_RUNNING() macro on every kernel launch and memory operation. It was doing a linear scan through all process slots (up to 1024) comparing PIDs to find the current process's status — O(n) per call. The process slot pointer is already cached in region_info.my_slot during init_proc_slot_withlock(). Use it directly for an O(1) fast path. The linear scan is preserved as a fallback for the edge case where my_slot has not yet been initialized. Also switched to proper atomic loads with acquire semantics for reading shared-memory fields in the slow path, consistent with the rest of the codebase. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
…ocking in pre_launch_kernel() pre_launch_kernel() is called on every cuLaunchKernel invocation. It was calling time(NULL) — a syscall — and always acquiring pthread_mutex_lock even when the timestamp hadn't changed. Changes: - Replace time(NULL) with clock_gettime(CLOCK_REALTIME_COARSE), which is served from the Linux vDSO (no syscall, ~1-4ms resolution). This is a safe drop-in: the recording interval is 1 second, so millisecond jitter is irrelevant. CLOCK_REALTIME_COARSE gives epoch time like time(), so dump_shrreg and other consumers are unaffected. - Add double-checked locking: check the timestamp before acquiring the mutex. On the fast path (>99.99% of calls, since kernels fire thousands/sec but interval is 1s), this becomes a single memory read + integer comparison — no syscall, no mutex. - The unlocked read of region_info.last_kernel_time is safe: uint64_t reads are atomic on x86-64 and aarch64 (aligned), and a torn read would at worst cause one extra mutex acquisition. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
…t shared memory ops rate_limiter() is called on every kernel launch when pidfound==1. It was performing 3 shared memory reads, 1 shared memory write, and 2 ensure_initialized() calls before reaching the actual rate-limiting CAS loop — all unnecessary per-call overhead. Changes: - Cache sm_limit and utilization_switch in static locals during init_utilization_watcher(). These values are set at container startup and do not change at runtime. - Use cached values for the fast-exit check instead of reading from shared memory on every call. When limiting is inactive (sm_limit >= 100 or == 0), rate_limiter becomes a single branch on a local variable. - Remove set_recent_kernel(2) — it unconditionally wrote 2 to shared memory, but the value was already 2 (set at init and never changed to anything else). This dirtied a cross-process cache line on every kernel launch for no effect. - Remove duplicate get_current_device_sm_limit(0) call (was called twice with identical arguments). - Reduce sleep(1) to usleep(1000) in the defensive recent_kernel guard (unreachable in current codebase, but safer if triggered). - The actual CAS spin loop and 10ms nanosleep backoff are unchanged, preserving correct rate-limiting behavior when 0 < sm_limit < 100. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
oom_check() called cuDeviceGetCount() on every memory allocation, but never used the result — the variable count1 was written to and then discarded. This was a wasted driver API call on every alloc. Remove the dead call entirely. The device count does not change at runtime and is not needed by this function's logic, which only operates on the specific device passed via the dev parameter. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
Document all P0 and P1 changes for reducing CUDA hijacking overhead: - Problem description, fix rationale, behavior preservation notes, and testing guidance for each commit. - Expected impact summary table. - Benchmarking instructions. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
g_log_level, fp1, and log_utils_init() were defined in utils.c, which is only compiled into libvgpu.so. The shrreg-tool executable links multiprocess_mod (which uses LOG_* macros referencing g_log_level) but does not link utils.o, causing undefined reference errors. Fix: extract these definitions into a standalone log_utils.c and add it to both the main libvgpu library and the shrreg-tool executable in the CMake build files. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
The previous optimization moved the sm_limit fast-exit before the get_recent_kernel()/set_recent_kernel(2) calls, which meant the "GPU is active" signal was no longer written when SM limiting was disabled. While current code only reads recent_kernel within rate_limiter itself, the shared memory field could be observed by external tooling or future features for OOM decisions or memory accounting. Restore the original call order: recent_kernel read/write happens unconditionally on every call, then the cached sm_limit/util_switch check determines whether to proceed to the CAS rate-limiting loop. The remaining optimizations are preserved: - Cached sm_limit/utilization_switch (eliminates 3 shared memory reads + 2 ensure_initialized calls) - Reduced sleep(1) to usleep(1000) in the defensive guard - Removed duplicate get_current_device_sm_limit(0) call Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
…er()" This reverts commit bfea6e1. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
bfb09df to
5afd3cd
Compare
5afd3cd to
407e5cb
Compare
… behavior Restore the unconditional set_recent_kernel(2) call that was removed in the rate_limiter optimization. The write has negligible cost (~100- 200ns cache line store) compared to the other savings in this function, and removing it changes observable shared memory state which could affect external tooling or future features. The call is placed after the cached sm_limit/util_switch fast-exit, matching the original position relative to the get_recent_kernel() guard. All other optimizations (cached limits, removed duplicate sm_limit call, reduced sleep) are preserved. Signed-off-by nishitnshah <nishshah@linkedin.com> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
407e5cb to
66e7ede
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
HAMi-core Performance Optimizations
Background
The HAMi-core CUDA library hijacking layer was introducing ~23% overhead
on training workloads. Profiling identified the overhead in per-call work
performed by intercepted CUDA functions — not in the actual resource
limiting logic, but in bookkeeping (logging, status checks, shared memory
reads) that executed on every CUDA API call regardless of whether limiting
was active.
This document describes the P0 (critical) and P1 (high-impact)
optimizations applied to reduce this overhead.
Commit 1 — [P0] Cache log level (
log_utils.h,utils.c,libvgpu.c)Problem
Every
LOG_DEBUG,LOG_INFO,LOG_WARN, andLOG_MSGmacro calledgetenv("LIBCUDA_LOG_LEVEL")andatoi()on every invocation.getenv()performs a linear scan of the environment block. These macrosappear in every intercepted CUDA function, so this overhead accumulated
across thousands of calls per second.
Fix
int g_log_levelglobal variable (default: 2 = warn level).log_utils_init()function that readsLIBCUDA_LOG_LEVELonce.LOG_*macros to checkg_log_level(a single integercomparison) instead of calling
getenv()+atoi().log_utils_init()is called frompreInit()inlibvgpu.c.Behavior preserved
original behavior where
LOG_WARN/LOG_MSGlogged when env was NULL.LOG_ERRORremains unconditional (always emitted).LIBCUDA_LOG_LEVELenv var still controls log level; it is simplyread once at library load time.
Testing
LIBCUDA_LOG_LEVEL=4and verify DEBUG output appears.LIBCUDA_LOG_LEVEL=0and verify only ERROR output appears.LIBCUDA_LOG_LEVELand verify WARN/MSG/ERROR output appears(same as before).
Commit 2 — [P0] Use cached slot in
wait_status_self()(multiprocess_memory_limit.c)Problem
wait_status_self()is called via theENSURE_RUNNING()macro on everykernel launch and memory operation. It performed a linear scan through
all process slots (up to 1024 entries), comparing PIDs via
getpid()tofind the current process's status field. This was O(n) per call.
Fix
region_info.my_slotpointer (set duringinit_proc_slot_withlock()at startup) for O(1) direct access.my_slotis NULL (defensive, shouldnot happen after initialization).
atomic_load_explicitwithmemory_order_acquireforreading shared-memory fields.
Behavior preserved
if process not found.
Testing
ENSURE_RUNNING()is exercised onevery kernel launch and memory allocation.
Commit 3 — [P1] Optimize
pre_launch_kernel()(multiprocess_memory_limit.c)Problem
pre_launch_kernel()runs on everycuLaunchKernelcall. It was:time(NULL)— a syscall into the kernel.pthread_mutex_lockeven when the timestamp had notchanged (recording interval is 1 second, kernels fire thousands/sec).
Fix
time(NULL)withclock_gettime(CLOCK_REALTIME_COARSE),which is served from the Linux vDSO (no syscall). Resolution is ~1-4ms,
which is irrelevant for a 1-second recording interval. Uses
CLOCK_REALTIME_COARSE(notMONOTONIC) to preserve epoch-timesemantics for
dump_shrregand other consumers.the mutex. The fast path (>99.99% of calls) becomes a single memory
read + integer comparison. The mutex is only taken when an update is
actually needed (~once per second).
Correctness notes
region_info.last_kernel_timeis safe:uint64_treads are atomic on x86-64 and aarch64 (aligned). A torn read would at
worst cause one unnecessary mutex acquisition, not incorrect behavior.
Testing
dump_shrregtool while a CUDA workload is active — verifylast_kernel_timestill updates correctly (once per second).this change.
Commit 4 — [P1] Optimize
rate_limiter()(multiprocess_utilization_watcher.c)Problem
rate_limiter()runs on every kernel launch whenpidfound==1. Beforereaching the actual rate-limiting CAS loop, it performed:
get_recent_kernel()— shared memory readset_recent_kernel(2)— shared memory write (always writing 2, whichwas already the value — a no-op that dirtied a cross-process cache line)
get_current_device_sm_limit(0)— called twice (redundant)get_utilization_switch()— shared memory readThat is 3 shared memory reads + 1 write + 2
ensure_initialized()callson every kernel launch, even when rate limiting was inactive.
Fix
sm_limitandutilization_switchin static locals duringinit_utilization_watcher(). These values are set at container startupand do not change at runtime.
(sm_limit >= 100 or == 0),
rate_limiterreturns after a singlebranch on a local variable.
set_recent_kernel(2)— eliminated the shared memory write.get_current_device_sm_limit(0)call.sleep(1)tousleep(1000)in the defensiverecent_kernelguard (currently unreachable but safer if triggered externally).
nanosleepbackoff are unchanged,preserving correct rate-limiting when 0 < sm_limit < 100.
Behavior preserved
mechanism works identically — the only difference is reaching the CAS
loop faster (fewer shared memory reads before it).
conditions.
Testing
CUDA_DEVICE_SM_LIMIT=50: verify SM utilization is capped asbefore. Run a compute-heavy workload and confirm utilization stays
near the configured limit.
CUDA_DEVICE_SM_LIMIT=100(or unset): verify no rate limitingoccurs and kernel throughput matches native (no HAMi) baseline.
Commit 5 — [P1] Remove dead
cuDeviceGetCountinoom_check()(allocator.c)Problem
oom_check()calledcuDeviceGetCount()on every memory allocation,storing the result in
count1— which was never read. This was a wastedCUDA driver API call on every allocation.
Fix
Removed the dead
cuDeviceGetCountcall and the unusedcount1variable. The function only needs the specific device ID passed via the
devparameter, not the total device count.Testing
test_alloc,test_runtime_alloc, etc.)to verify OOM checking still works correctly.
Expected Impact
getenv()+atoi()per LOG macromy_slotinwait_status_selftime()syscall + mutex lock/unlockcuDeviceGetCountCombined, these changes should reduce the hijacking overhead from ~23%
to under 5% for typical training workloads.
How to benchmark
Compare wall-clock time and throughput (samples/sec) across the three runs.