-
-
Notifications
You must be signed in to change notification settings - Fork 169
[FEATURE] Optimize CUDA attach latency for PTX-based GPU injection #552
Description
Is your feature request related to a problem? Please describe.
Current CUDA/GPU attach on master is still dominated by the PTX extraction/patch/compile path. On a small llama.cpp workload with a 1B model, the first cold fatbin attach still spends tens of seconds in patch+compile before the workload can continue.
This issue tracks optimization of that attach path and records the full measurement results used as the baseline.
Describe the solution you'd like
Reduce end-to-end GPU attach latency, especially the first cold attach, for PTX-based injection workloads.
At minimum, the optimization target should cover:
- PTX extraction overhead
- PTX patch latency
- PTX compile latency
- Module load latency
- Repeated fatbin handling during a single process run
Test environment
- Repo baseline used for measurement:
master@99f5225643563ecae8c8daeafacb40d06d122b4e - Measurement workspace: fresh git worktree on
master - GPU: NVIDIA GeForce RTX 5090
- Driver:
575.57.08 - CUDA toolchain used for the llama.cpp PTX-enabled build:
/usr/local/cuda-12.9 - Python torch wheel available locally during testing:
torch 2.8.0+cu128
For timing collection I added temporary logging in:
attach/nv_attach_impl/nv_attach_impl_frida_setup.cppattach/nv_attach_impl/nv_attach_fatbin_record.cpp
llama.cpp test setup
The user explicitly requested a 1B model for the llama.cpp injection test.
- Model:
tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf - Build: PTX-enabled llama.cpp build with
CMAKE_CUDA_ARCHITECTURES=120-real;120-virtual - Probe example:
example/gpu/llama-cpp-test/threadhist - Target symbol:
_Z12rms_norm_f32ILi1024ELb1ELb0EEvPKfPfilllfS1_lll5uint3S3_S3_S3_S1_lllS3_S3_S3_S3_
Reproduction command shape:
BPFTIME_LOG_OUTPUT=console SPDLOG_LEVEL=info \
bpftime -i .install-rel load example/gpu/llama-cpp-test/threadhist > loader.log 2>&1 &
BPFTIME_LOG_OUTPUT=console SPDLOG_LEVEL=info \
LD_LIBRARY_PATH=$LLAMA_BUILD/bin \
timeout 120s bpftime -i .install-rel start \
$LLAMA_BUILD/bin/llama-cli \
-m $MODEL \
-ngl 999 \
-p 'Hello' \
-n 1 \
-no-cnv \
--no-warmup \
--no-display-prompt > target.log 2>&1llama.cpp timing results
Captured 48 complete fatbin timing groups from a single run.
Summary:
| Case | Extract ms | Patch ms | Compile ms | Load ms | Attach total ms |
|---|---|---|---|---|---|
| First cold fatbin | 457 | 6729 | 21165 | 69 | 27965 |
| Second fatbin | 541 | 498 | 1208 | 29 | 1737 |
| Mean of fatbins 3..48 | 535.15 | 121.37 | 61.30 | 27.02 | 210.48 |
| Min total over all 48 | - | - | - | - | 197 |
| Max total over all 48 | - | - | - | - | 27965 |
Full per-fatbin results:
idx,extract_ms,patch_ms,compile_ms,load_ms,attach_total_ms
1,457,6729,21165,69,27965
2,541,498,1208,29,1737
3,492,128,66,28,224
4,494,122,60,28,212
5,499,125,61,28,215
6,490,122,63,28,214
7,493,127,62,28,217
8,505,125,64,28,218
9,494,122,63,28,215
10,514,124,62,28,215
11,513,126,62,28,217
12,518,125,63,28,217
13,515,126,65,28,221
14,512,129,62,28,220
15,517,125,65,28,219
16,525,124,64,28,218
17,517,123,64,28,216
18,527,122,62,28,213
19,527,124,61,28,214
20,535,121,63,28,212
21,520,127,64,28,220
22,544,125,63,28,216
23,531,128,62,27,218
24,540,124,60,29,214
25,557,122,64,27,214
26,544,121,64,28,213
27,546,122,62,28,213
28,551,123,63,28,214
29,561,124,63,27,215
30,547,119,59,26,205
31,537,116,58,26,200
32,537,117,62,26,205
33,541,117,58,26,201
34,546,119,56,25,201
35,545,115,60,26,202
36,546,118,60,26,204
37,574,117,60,25,203
38,548,117,59,26,203
39,548,119,60,26,205
40,551,115,59,26,201
41,560,122,61,26,209
42,581,116,56,26,198
43,559,121,61,26,208
44,567,116,59,26,201
45,562,113,60,25,199
46,553,118,58,25,202
47,566,115,56,25,197
48,568,117,61,26,204
Important caveat from the llama.cpp run
This run successfully measured attach-time stages, but the target kretprobe did not actually fire end-to-end.
Observed facts from the logs:
- Loader side created the handler for the target
rms_norm_f32symbol. threadhistcounters stayed at0.- The target log contained:
Failed to find NVPTX target: Unable to find target for this triple (no targets are registered)
Unable to run pass on kernel _Z12rms_norm_f32ILi1024ELb1ELb0EEvPKfPfilllfS1_lll5uint3S3_S3_S3_S1_lllS3_S3_S3_S3_: 64
So the timing numbers above are valid attach-path timings, but not yet a fully successful probe-hit measurement.
PyTorch test result
I also tried to get a comparable GPU attach number from PyTorch.
What I found on this machine:
- Installed wheel:
torch 2.8.0+cu128 nm -D libtorch_cuda.soshows the expected internal CUDA kernel symbols such asbitonicSortKVInPlace...- But
cuobjdump --dump-ptx libtorch_cuda.so | rg '\.version|\.entry|bitonicSortKVInPlace'returned no PTX entries - The
cuobjdump --dump-ptxoutput only showedFatbin elf codesections, which means this local wheel is effectively cubin-only for this purpose
Result:
- I could not get a valid PyTorch internal-kernel GPU attach timing from the currently installed wheel
- This is not just a missing example issue; the current binary does not expose usable PTX for the bpftime PTX patch path
This matches the current example doc in example/gpu/pytorch-test/README.md, which already requires building PyTorch from source with PTX included.
For this machine, the example README's old TORCH_CUDA_ARCH_LIST=6.1+PTX is not the right arch target. On RTX 5090 the source-build path should instead use a 12.0 PTX target, for example:
TORCH_CUDA_ARCH_LIST=12.0+PTXWhy this issue matters
The cold attach path is currently too slow for practical dynamic injection on real GPU applications:
- first cold attach: about
27.97 s - second fatbin in the same run: about
1.74 s - steady-state later fatbins: about
0.21 sattach time each, still with about0.54 sPTX extraction each
Even on a very small 1B llama.cpp setup, attach is still dominated by patch/compile on the first relevant fatbin.
Additional context
This issue is intended to track performance optimization of the current attach implementation. A follow-up item is still needed to fix the NVPTX target registration / pass execution problem so that the attach benchmark can be paired with a confirmed successful probe hit in the same run.