[FEATURE] Optimize CUDA attach latency for PTX-based GPU injection

**Is your feature request related to a problem? Please describe.**

Current CUDA/GPU attach on `master` is still dominated by the PTX extraction/patch/compile path. On a small `llama.cpp` workload with a 1B model, the first cold fatbin attach still spends tens of seconds in patch+compile before the workload can continue.

This issue tracks optimization of that attach path and records the full measurement results used as the baseline.

**Describe the solution you'd like**

Reduce end-to-end GPU attach latency, especially the first cold attach, for PTX-based injection workloads.

At minimum, the optimization target should cover:

1. PTX extraction overhead
2. PTX patch latency
3. PTX compile latency
4. Module load latency
5. Repeated fatbin handling during a single process run

**Test environment**

- Repo baseline used for measurement: `master` @ `99f5225643563ecae8c8daeafacb40d06d122b4e`
- Measurement workspace: fresh git worktree on `master`
- GPU: NVIDIA GeForce RTX 5090
- Driver: `575.57.08`
- CUDA toolchain used for the llama.cpp PTX-enabled build: `/usr/local/cuda-12.9`
- Python torch wheel available locally during testing: `torch 2.8.0+cu128`

For timing collection I added temporary logging in:

- `attach/nv_attach_impl/nv_attach_impl_frida_setup.cpp`
- `attach/nv_attach_impl/nv_attach_fatbin_record.cpp`

**llama.cpp test setup**

The user explicitly requested a 1B model for the llama.cpp injection test.

- Model: `tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf`
- Build: PTX-enabled llama.cpp build with `CMAKE_CUDA_ARCHITECTURES=120-real;120-virtual`
- Probe example: `example/gpu/llama-cpp-test/threadhist`
- Target symbol: `_Z12rms_norm_f32ILi1024ELb1ELb0EEvPKfPfilllfS1_lll5uint3S3_S3_S3_S1_lllS3_S3_S3_S3_`

Reproduction command shape:

```bash
BPFTIME_LOG_OUTPUT=console SPDLOG_LEVEL=info \
  bpftime -i .install-rel load example/gpu/llama-cpp-test/threadhist > loader.log 2>&1 &

BPFTIME_LOG_OUTPUT=console SPDLOG_LEVEL=info \
LD_LIBRARY_PATH=$LLAMA_BUILD/bin \
  timeout 120s bpftime -i .install-rel start \
  $LLAMA_BUILD/bin/llama-cli \
  -m $MODEL \
  -ngl 999 \
  -p 'Hello' \
  -n 1 \
  -no-cnv \
  --no-warmup \
  --no-display-prompt > target.log 2>&1
```

**llama.cpp timing results**

Captured 48 complete fatbin timing groups from a single run.

Summary:

| Case | Extract ms | Patch ms | Compile ms | Load ms | Attach total ms |
| --- | ---: | ---: | ---: | ---: | ---: |
| First cold fatbin | 457 | 6729 | 21165 | 69 | 27965 |
| Second fatbin | 541 | 498 | 1208 | 29 | 1737 |
| Mean of fatbins 3..48 | 535.15 | 121.37 | 61.30 | 27.02 | 210.48 |
| Min total over all 48 | - | - | - | - | 197 |
| Max total over all 48 | - | - | - | - | 27965 |

Full per-fatbin results:

```text
idx,extract_ms,patch_ms,compile_ms,load_ms,attach_total_ms
1,457,6729,21165,69,27965
2,541,498,1208,29,1737
3,492,128,66,28,224
4,494,122,60,28,212
5,499,125,61,28,215
6,490,122,63,28,214
7,493,127,62,28,217
8,505,125,64,28,218
9,494,122,63,28,215
10,514,124,62,28,215
11,513,126,62,28,217
12,518,125,63,28,217
13,515,126,65,28,221
14,512,129,62,28,220
15,517,125,65,28,219
16,525,124,64,28,218
17,517,123,64,28,216
18,527,122,62,28,213
19,527,124,61,28,214
20,535,121,63,28,212
21,520,127,64,28,220
22,544,125,63,28,216
23,531,128,62,27,218
24,540,124,60,29,214
25,557,122,64,27,214
26,544,121,64,28,213
27,546,122,62,28,213
28,551,123,63,28,214
29,561,124,63,27,215
30,547,119,59,26,205
31,537,116,58,26,200
32,537,117,62,26,205
33,541,117,58,26,201
34,546,119,56,25,201
35,545,115,60,26,202
36,546,118,60,26,204
37,574,117,60,25,203
38,548,117,59,26,203
39,548,119,60,26,205
40,551,115,59,26,201
41,560,122,61,26,209
42,581,116,56,26,198
43,559,121,61,26,208
44,567,116,59,26,201
45,562,113,60,25,199
46,553,118,58,25,202
47,566,115,56,25,197
48,568,117,61,26,204
```

**Important caveat from the llama.cpp run**

This run successfully measured attach-time stages, but the target kretprobe did not actually fire end-to-end.

Observed facts from the logs:

- Loader side created the handler for the target `rms_norm_f32` symbol.
- `threadhist` counters stayed at `0`.
- The target log contained:

```text
Failed to find NVPTX target: Unable to find target for this triple (no targets are registered)
Unable to run pass on kernel _Z12rms_norm_f32ILi1024ELb1ELb0EEvPKfPfilllfS1_lll5uint3S3_S3_S3_S1_lllS3_S3_S3_S3_: 64
```

So the timing numbers above are valid attach-path timings, but not yet a fully successful probe-hit measurement.

**PyTorch test result**

I also tried to get a comparable GPU attach number from PyTorch.

What I found on this machine:

- Installed wheel: `torch 2.8.0+cu128`
- `nm -D libtorch_cuda.so` shows the expected internal CUDA kernel symbols such as `bitonicSortKVInPlace...`
- But `cuobjdump --dump-ptx libtorch_cuda.so | rg '\.version|\.entry|bitonicSortKVInPlace'` returned no PTX entries
- The `cuobjdump --dump-ptx` output only showed `Fatbin elf code` sections, which means this local wheel is effectively cubin-only for this purpose

Result:

- I could not get a valid PyTorch internal-kernel GPU attach timing from the currently installed wheel
- This is not just a missing example issue; the current binary does not expose usable PTX for the bpftime PTX patch path

This matches the current example doc in `example/gpu/pytorch-test/README.md`, which already requires building PyTorch from source with PTX included.

For this machine, the example README's old `TORCH_CUDA_ARCH_LIST=6.1+PTX` is not the right arch target. On RTX 5090 the source-build path should instead use a 12.0 PTX target, for example:

```bash
TORCH_CUDA_ARCH_LIST=12.0+PTX
```

**Why this issue matters**

The cold attach path is currently too slow for practical dynamic injection on real GPU applications:

- first cold attach: about `27.97 s`
- second fatbin in the same run: about `1.74 s`
- steady-state later fatbins: about `0.21 s` attach time each, still with about `0.54 s` PTX extraction each

Even on a very small 1B llama.cpp setup, attach is still dominated by patch/compile on the first relevant fatbin.

**Additional context**

This issue is intended to track performance optimization of the current attach implementation. A follow-up item is still needed to fix the NVPTX target registration / pass execution problem so that the attach benchmark can be paired with a confirmed successful probe hit in the same run.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Optimize CUDA attach latency for PTX-based GPU injection #552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Case	Extract ms	Patch ms	Compile ms	Load ms	Attach total ms
First cold fatbin	457	6729	21165	69	27965
Second fatbin	541	498	1208	29	1737
Mean of fatbins 3..48	535.15	121.37	61.30	27.02	210.48
Min total over all 48	-	-	-	-	197
Max total over all 48	-	-	-	-	27965

Uh oh!

[FEATURE] Optimize CUDA attach latency for PTX-based GPU injection #552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions