Skip to content

[FEATURE] Optimize CUDA attach latency for PTX-based GPU injection #552

@yunwei37

Description

@yunwei37

Is your feature request related to a problem? Please describe.

Current CUDA/GPU attach on master is still dominated by the PTX extraction/patch/compile path. On a small llama.cpp workload with a 1B model, the first cold fatbin attach still spends tens of seconds in patch+compile before the workload can continue.

This issue tracks optimization of that attach path and records the full measurement results used as the baseline.

Describe the solution you'd like

Reduce end-to-end GPU attach latency, especially the first cold attach, for PTX-based injection workloads.

At minimum, the optimization target should cover:

  1. PTX extraction overhead
  2. PTX patch latency
  3. PTX compile latency
  4. Module load latency
  5. Repeated fatbin handling during a single process run

Test environment

  • Repo baseline used for measurement: master @ 99f5225643563ecae8c8daeafacb40d06d122b4e
  • Measurement workspace: fresh git worktree on master
  • GPU: NVIDIA GeForce RTX 5090
  • Driver: 575.57.08
  • CUDA toolchain used for the llama.cpp PTX-enabled build: /usr/local/cuda-12.9
  • Python torch wheel available locally during testing: torch 2.8.0+cu128

For timing collection I added temporary logging in:

  • attach/nv_attach_impl/nv_attach_impl_frida_setup.cpp
  • attach/nv_attach_impl/nv_attach_fatbin_record.cpp

llama.cpp test setup

The user explicitly requested a 1B model for the llama.cpp injection test.

  • Model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  • Build: PTX-enabled llama.cpp build with CMAKE_CUDA_ARCHITECTURES=120-real;120-virtual
  • Probe example: example/gpu/llama-cpp-test/threadhist
  • Target symbol: _Z12rms_norm_f32ILi1024ELb1ELb0EEvPKfPfilllfS1_lll5uint3S3_S3_S3_S1_lllS3_S3_S3_S3_

Reproduction command shape:

BPFTIME_LOG_OUTPUT=console SPDLOG_LEVEL=info \
  bpftime -i .install-rel load example/gpu/llama-cpp-test/threadhist > loader.log 2>&1 &

BPFTIME_LOG_OUTPUT=console SPDLOG_LEVEL=info \
LD_LIBRARY_PATH=$LLAMA_BUILD/bin \
  timeout 120s bpftime -i .install-rel start \
  $LLAMA_BUILD/bin/llama-cli \
  -m $MODEL \
  -ngl 999 \
  -p 'Hello' \
  -n 1 \
  -no-cnv \
  --no-warmup \
  --no-display-prompt > target.log 2>&1

llama.cpp timing results

Captured 48 complete fatbin timing groups from a single run.

Summary:

Case Extract ms Patch ms Compile ms Load ms Attach total ms
First cold fatbin 457 6729 21165 69 27965
Second fatbin 541 498 1208 29 1737
Mean of fatbins 3..48 535.15 121.37 61.30 27.02 210.48
Min total over all 48 - - - - 197
Max total over all 48 - - - - 27965

Full per-fatbin results:

idx,extract_ms,patch_ms,compile_ms,load_ms,attach_total_ms
1,457,6729,21165,69,27965
2,541,498,1208,29,1737
3,492,128,66,28,224
4,494,122,60,28,212
5,499,125,61,28,215
6,490,122,63,28,214
7,493,127,62,28,217
8,505,125,64,28,218
9,494,122,63,28,215
10,514,124,62,28,215
11,513,126,62,28,217
12,518,125,63,28,217
13,515,126,65,28,221
14,512,129,62,28,220
15,517,125,65,28,219
16,525,124,64,28,218
17,517,123,64,28,216
18,527,122,62,28,213
19,527,124,61,28,214
20,535,121,63,28,212
21,520,127,64,28,220
22,544,125,63,28,216
23,531,128,62,27,218
24,540,124,60,29,214
25,557,122,64,27,214
26,544,121,64,28,213
27,546,122,62,28,213
28,551,123,63,28,214
29,561,124,63,27,215
30,547,119,59,26,205
31,537,116,58,26,200
32,537,117,62,26,205
33,541,117,58,26,201
34,546,119,56,25,201
35,545,115,60,26,202
36,546,118,60,26,204
37,574,117,60,25,203
38,548,117,59,26,203
39,548,119,60,26,205
40,551,115,59,26,201
41,560,122,61,26,209
42,581,116,56,26,198
43,559,121,61,26,208
44,567,116,59,26,201
45,562,113,60,25,199
46,553,118,58,25,202
47,566,115,56,25,197
48,568,117,61,26,204

Important caveat from the llama.cpp run

This run successfully measured attach-time stages, but the target kretprobe did not actually fire end-to-end.

Observed facts from the logs:

  • Loader side created the handler for the target rms_norm_f32 symbol.
  • threadhist counters stayed at 0.
  • The target log contained:
Failed to find NVPTX target: Unable to find target for this triple (no targets are registered)
Unable to run pass on kernel _Z12rms_norm_f32ILi1024ELb1ELb0EEvPKfPfilllfS1_lll5uint3S3_S3_S3_S1_lllS3_S3_S3_S3_: 64

So the timing numbers above are valid attach-path timings, but not yet a fully successful probe-hit measurement.

PyTorch test result

I also tried to get a comparable GPU attach number from PyTorch.

What I found on this machine:

  • Installed wheel: torch 2.8.0+cu128
  • nm -D libtorch_cuda.so shows the expected internal CUDA kernel symbols such as bitonicSortKVInPlace...
  • But cuobjdump --dump-ptx libtorch_cuda.so | rg '\.version|\.entry|bitonicSortKVInPlace' returned no PTX entries
  • The cuobjdump --dump-ptx output only showed Fatbin elf code sections, which means this local wheel is effectively cubin-only for this purpose

Result:

  • I could not get a valid PyTorch internal-kernel GPU attach timing from the currently installed wheel
  • This is not just a missing example issue; the current binary does not expose usable PTX for the bpftime PTX patch path

This matches the current example doc in example/gpu/pytorch-test/README.md, which already requires building PyTorch from source with PTX included.

For this machine, the example README's old TORCH_CUDA_ARCH_LIST=6.1+PTX is not the right arch target. On RTX 5090 the source-build path should instead use a 12.0 PTX target, for example:

TORCH_CUDA_ARCH_LIST=12.0+PTX

Why this issue matters

The cold attach path is currently too slow for practical dynamic injection on real GPU applications:

  • first cold attach: about 27.97 s
  • second fatbin in the same run: about 1.74 s
  • steady-state later fatbins: about 0.21 s attach time each, still with about 0.54 s PTX extraction each

Even on a very small 1B llama.cpp setup, attach is still dominated by patch/compile on the first relevant fatbin.

Additional context

This issue is intended to track performance optimization of the current attach implementation. A follow-up item is still needed to fix the NVPTX target registration / pass execution problem so that the attach benchmark can be paired with a confirmed successful probe hit in the same run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions