fix: restore set_recent_kernel(2) in rate_limiter() to match original behavior

nishitnshah · nishitnshah · commit 9c1c6ffcc3f1 · 2026-03-24T15:59:55.000-07:00
Restore the unconditional set_recent_kernel(2) call that was removed
in the rate_limiter optimization. The write has negligible cost (~100-
200ns cache line store) compared to the other savings in this function,
and removing it changes observable shared memory state which could
affect external tooling or future features.

The call is placed after the cached sm_limit/util_switch fast-exit,
matching the original position relative to the get_recent_kernel()
guard. All other optimizations (cached limits, removed duplicate
sm_limit call, reduced sleep) are preserved.

Signed-off-by nishitnshah &lt;nishshah@linkedin.com&gt;
diff --git a/src/multiprocess/multiprocess_utilization_watcher.c b/src/multiprocess/multiprocess_utilization_watcher.c
@@ -50,6 +50,7 @@ void rate_limiter(int grids, int blocks) {
   while (get_recent_kernel() < 0) {
     usleep(1000);
   }
+  set_recent_kernel(2);
 
   LOG_DEBUG("grid: %d, blocks: %d", grids, blocks);
   LOG_DEBUG("launch kernel %ld, curr core: %ld", kernel_size, g_cur_cuda_cores);

Original file line number	Diff line number	Diff line change
`@@ -50,6 +50,7 @@ void rate_limiter(int grids, int blocks) {`
`50`	`50`	`while (get_recent_kernel() < 0) {`
`51`	`51`	`usleep(1000);`
`52`	`52`	`}`
	`53`	`+ set_recent_kernel(2);`
`53`	`54`
`54`	`55`	`LOG_DEBUG("grid: %d, blocks: %d", grids, blocks);`
`55`	`56`	`LOG_DEBUG("launch kernel %ld, curr core: %ld", kernel_size, g_cur_cuda_cores);`