[WIP] rocprofiler-sdk/rocr-runtime: fix HIP graph kernel-trace event loss#4221
[WIP] rocprofiler-sdk/rocr-runtime: fix HIP graph kernel-trace event loss#4221
Conversation
|
Added the HIP graph kernel-trace before/after artifacts for the strongest local comparison case. Case:
Secret gist: Direct SVG artifact: Summary:
|
43bed7b to
862106a
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces an experimental “HotSwap” stack across ROCR runtime, HIP (CLR), and rocprofiler-sdk aimed at reducing kernel-trace event loss in HIP graph workloads by enabling load-time ISA retargeting/transpilation, deferring signal retirement until queues are idle, and coalescing async wakeups.
Changes:
- Add optional ROCR HotSwap loader support (ISA override, retarget/transpile hooks, rewrite-rule engine + LLVM MC plumbing).
- Update rocprofiler-sdk queue interception to defer signal destruction and reduce async-thread overhead, plus minor formatting/compat tweaks.
- Add ROCR async wake coalescing and supporting tests/scripts for HotSwap components.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rocr-runtime/runtime/hsa-runtime/loader/executable.cpp | Integrates HotSwap ISA override/retarget/transpile/rewrite into code object loading path |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap.hpp | Public HotSwap API declarations (enablement, patch/retarget/rewrite) |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap.cpp | Implements ELF parsing + LLVM MC based retarget/rewrite logic |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap_rules.hpp | JSON rules data model for rewrite engine |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap_rules.cpp | Minimal JSON parser + rules caching for HotSwap |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/transpiler.hpp | Cross-family transpiler API + stats struct |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/trampoline.hpp | Trampoline interfaces for size-changing rewrites |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/trampoline.cpp | Trampoline assembly + s_branch/s_nop encoding helpers |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_transpiler.py | Standalone mnemonic translation checks (llvm-mc) |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_transpiler_e2e.py | End-to-end asm→disasm→translate→asm validation script |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_rules.json | Example rule file for rewrite engine testing |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/hotswap_test.cpp | Standalone C++ tests for rules + trampoline encoding |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/pycache/test_transpiler_e2e.cpython-312.pyc | Adds compiled artifact (should not be committed) |
| projects/rocr-runtime/runtime/hsa-runtime/hotswap/CMakeLists.txt | Standalone CMake build for HotSwap library |
| projects/rocr-runtime/runtime/hsa-runtime/core/runtime/runtime.cpp | Coalesce async wake requests + reserve async event buffers |
| projects/rocr-runtime/runtime/hsa-runtime/core/inc/runtime.h | Adds wake coalescing + reserve API declarations |
| projects/rocr-runtime/runtime/hsa-runtime/CMakeLists.txt | Adds ROCR_ENABLE_HOTSWAP option and LLVM linkage |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/registration.cpp | Gates late register propagation behind env var |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp | Adds deferred signal retirement + queue state atomics |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp | Implements deferred destruction + async handler bookkeeping changes |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp | Marks queue to_destroy before sync/destroy |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/details/fmt.hpp | Improves memory-copy-op formatting for extended layouts |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/context/correlation_id.cpp | Switches latest correlation-id storage to std::vector |
| projects/clr/hipamd/src/hip_fatbin.cpp | Adds HotSwap-driven ISA override/retarget fallback for fatbin extraction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
projects/rocr-runtime/runtime/hsa-runtime/loader/executable.cpp
Outdated
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/loader/executable.cpp
Outdated
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/loader/executable.cpp
Outdated
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/loader/executable.cpp
Outdated
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/hotswap/trampoline.cpp
Outdated
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_transpiler.py
Outdated
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_transpiler_e2e.py
Outdated
Show resolved
Hide resolved
|
Clean PR-only re-verification on the HIP graph reproducer: I rebuilt from a clean worktree containing only the 4 PR commits and reran the reproducer against those staged artifacts. One correction from the earlier local note: the first rerun was still mixing stacks because the launcher could fall back into the venv ROCm tree unless the staged
Conclusion: this PR improves the reproducer enough to make the This comment supersedes the earlier local verification note that was based on a mixed stack. |
|
Checked the Copilot review threads after the branch rewrite. All 10 Copilot comments are attached to the earlier hotswap/CLR diff, not to the current PR contents. The live PR file list now contains only these 8 files:
So there are no current Copilot findings on the active patch set to address in code. I resolved the stale threads to reduce review noise. |
| private: | ||
| AsyncEventsInfo* info_; | ||
| os::Thread thread_; | ||
| std::atomic<bool> wake_pending_; |
There was a problem hiding this comment.
I think this change is unnecessary. The call below is just an atomic operation and KFD call that should wakeup the async thread is already under extra protection
hsa_signal_handle(asyncInfo->control.wake)->StoreRelease(1);
void InterruptSignal::StoreRelease(hsa_signal_value_t value) {
atomic::Store(&signal_.value, int64_t(value), std::memory_order_release);
SetEvent();
}
void InterruptSignal::SetEvent() {
if (InWaiting()) HSAKMT_CALL(hsaKmtSetEvent(event_));
}
There was a problem hiding this comment.
InWaiting() only gates whether SetEvent() makes the KFD wake syscall at that instant. It does not coalesce repeated logical wake requests while the async thread is still working through a large pending batch.
The issue we were chasing here was not just too many KFD wake syscalls while the thread is asleep, but repeated re-wakeup pressure during HIP-graph kernel-trace bursts where registrations were arriving faster than the async loop could drain them. With RequestWake()/ResetWake(), we allow at most one outstanding wake request until the loop reaches a point where it has observed and drained the current batch, then we clear the pending bit and permit the next wake.
So the intended difference is:
- InWaiting() avoids an unnecessary kernel wake when the thread is not blocked in KFD.
- wake_pending_ avoids repeatedly signaling the same outstanding work before the async loop has had a chance to consume it.
In our local HIP-graph kernel-trace repro this was not just theoretical; the coalescing reduced async wake pressure materially. I can add a short code comment in RequestWake() as well if that would make the intent clearer.
Summary
This PR contains the local HIP graph / rocprof kernel-trace investigation stack:
rocprofiler-sdkrocprofiler-sdkValidation
The strongest local HIP graph reproducer used for validation was:
hip_graph_bubble_reproNUM_KERNELS=2000NUM_ITERATIONS=200--kernel-trace--output-format csvI also uploaded a secret gist with the before/after analysis and screenshot artifact. I will add that gist link in a PR comment after creation.