[+] Feat: Support dynamic attach #542

Sy0307 · 2026-01-25T17:46:42Z

Support dynamic attach, use thread_scheduling as example for test. We can do something to adapt other examples.

Also some changes is just re-format.

Key technical changes are shown following:

1. CUDA late-attach bootstrap (fatbin/PTX recovery after missed registration)

Problem: When the agent is injected after CUDA registration has already happened, hooks like __cudaRegisterFatBinary/__cudaRegisterFunction may be missed. The runtime then lacks fatbin/PTX material and “host stub → kernel” metadata to route launches through patched code, resulting in “no data”.
Solution: After nv_attach_impl is initialized, a one-time bootstrap scans already-loaded ELF objects:
- Enumerate loaded modules with dl_iterate_phdr;
- Locate .nv_fatbin in-memory and walk fatbin wrapper(s);
- Extract PTX, apply the existing ptxpass patching pipeline, compile, and load patched modules into the driver;
- Pre-fill a kernel_name → patched CUfunction cache for launch-time routing.

2. Launch routing with late-attach fallback

The cudaLaunchKernel interception path no longer strictly depends on the registration-time func_ptr → symbol_name mapping.
When the canonical mapping is missing, the hook attempts to resolve the host stub symbol name (dladdr + ELF symbol cache) and dispatches via the cached kernel_name → CUfunction mapping, preserving a safe fallback to the original runtime launch path if patched launch fails.

3. Shared-memory session/epoch protocol (control-plane/data-plane consistency)

Root cause of repeat-trace instability (“No data”, wrong data, random crashes): control-plane state in shm (handlers/maps/links) changes, while the injected target’s data-plane state (CUDA IPC pointers, device-side globals, patched module state) still points to the previous snapshot.
Mechanism: This PR introduces epoch_seq in bpftime_maps_shm with seqlock semantics:
- Odd: Server is mutating/resetting the snapshot;
- Even: Stable snapshot; session_id = epoch_seq / 2.
Process: The server advances epoch_seq and clears handlers at session start. The agent observes epoch changes and performs an ordered rebind:
- Detach existing links → clear instantiated bookkeeping → re-instantiate from the new stable shm snapshot.

4. Single-agent control plane (avoid multi-copy state splits)

Issue: Repeated tracing previously could re-inject the agent and accidentally create multiple in-process agent instances, splitting state and making failures hard to diagnose.
Solution: This PR adds a per-process agent control endpoint and uses IPC for refresh/detach/status whenever possible; injection becomes a fallback when IPC is not available.

[+] Feat: Support dynamic attach

7407a1c

pull-request-size bot added the size/XXL label Jan 25, 2026

[~] Revert: rollback some comments modifications

d3d41fe

Sy0307 marked this pull request as draft January 26, 2026 05:57

This was referenced Feb 1, 2026

Monthly Org Report (2026-01-01..2026-01-31) eunomia-bpf/eunomia.dev#69

Open

Weekly Org Report (2026-01-19..2026-01-25) eunomia-bpf/eunomia.dev#70

Open

yunwei37 marked this pull request as ready for review February 3, 2026 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[+] Feat: Support dynamic attach #542

[+] Feat: Support dynamic attach #542

Uh oh!

Sy0307 commented Jan 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[+] Feat: Support dynamic attach #542

Are you sure you want to change the base?

[+] Feat: Support dynamic attach #542

Uh oh!

Conversation

Sy0307 commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. CUDA late-attach bootstrap (fatbin/PTX recovery after missed registration)

2. Launch routing with late-attach fallback

3. Shared-memory session/epoch protocol (control-plane/data-plane consistency)

4. Single-agent control plane (avoid multi-copy state splits)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sy0307 commented Jan 25, 2026 •

edited

Loading