Skip to content

[WIP] rocprofiler-sdk/rocr-runtime: fix HIP graph kernel-trace event loss#4221

Draft
powderluv wants to merge 4 commits intodevelopfrom
users/powderluv/hip-graph-kernel-trace-fix
Draft

[WIP] rocprofiler-sdk/rocr-runtime: fix HIP graph kernel-trace event loss#4221
powderluv wants to merge 4 commits intodevelopfrom
users/powderluv/hip-graph-kernel-trace-fix

Conversation

@powderluv
Copy link
Copy Markdown

Summary

This PR contains the local HIP graph / rocprof kernel-trace investigation stack:

  • ROCm 7.13 memory-copy-op formatter compatibility in rocprofiler-sdk
  • late register propagation gated behind an env var
  • queue signal retirement deferred until queue idle in rocprofiler-sdk
  • ROCR async signal wake coalescing and wake reset fixes

Validation

The strongest local HIP graph reproducer used for validation was:

  • hip_graph_bubble_repro
  • NUM_KERNELS=2000
  • NUM_ITERATIONS=200
  • --kernel-trace
  • --output-format csv

I also uploaded a secret gist with the before/after analysis and screenshot artifact. I will add that gist link in a PR comment after creation.

Copilot AI review requested due to automatic review settings March 19, 2026 15:13
@powderluv powderluv requested review from a team as code owners March 19, 2026 15:13
@powderluv
Copy link
Copy Markdown
Author

Added the HIP graph kernel-trace before/after artifacts for the strongest local comparison case.

Case:

  • hip_graph_bubble_repro
  • NUM_KERNELS=2000
  • NUM_ITERATIONS=200
  • --kernel-trace
  • --output-format csv

Secret gist:

Direct SVG artifact:

Summary:

  • Unpatched stock stack: run aborts and produces no trace_kernel_trace.csv
  • Stock rocprofv3 + local ROCR async-wake fix: run completes and produces trace_kernel_trace.csv with 400000 rows / 400000 unique dispatch IDs

@powderluv powderluv force-pushed the users/powderluv/hip-graph-kernel-trace-fix branch from 43bed7b to 862106a Compare March 19, 2026 15:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental “HotSwap” stack across ROCR runtime, HIP (CLR), and rocprofiler-sdk aimed at reducing kernel-trace event loss in HIP graph workloads by enabling load-time ISA retargeting/transpilation, deferring signal retirement until queues are idle, and coalescing async wakeups.

Changes:

  • Add optional ROCR HotSwap loader support (ISA override, retarget/transpile hooks, rewrite-rule engine + LLVM MC plumbing).
  • Update rocprofiler-sdk queue interception to defer signal destruction and reduce async-thread overhead, plus minor formatting/compat tweaks.
  • Add ROCR async wake coalescing and supporting tests/scripts for HotSwap components.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
projects/rocr-runtime/runtime/hsa-runtime/loader/executable.cpp Integrates HotSwap ISA override/retarget/transpile/rewrite into code object loading path
projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap.hpp Public HotSwap API declarations (enablement, patch/retarget/rewrite)
projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap.cpp Implements ELF parsing + LLVM MC based retarget/rewrite logic
projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap_rules.hpp JSON rules data model for rewrite engine
projects/rocr-runtime/runtime/hsa-runtime/hotswap/hotswap_rules.cpp Minimal JSON parser + rules caching for HotSwap
projects/rocr-runtime/runtime/hsa-runtime/hotswap/transpiler.hpp Cross-family transpiler API + stats struct
projects/rocr-runtime/runtime/hsa-runtime/hotswap/trampoline.hpp Trampoline interfaces for size-changing rewrites
projects/rocr-runtime/runtime/hsa-runtime/hotswap/trampoline.cpp Trampoline assembly + s_branch/s_nop encoding helpers
projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_transpiler.py Standalone mnemonic translation checks (llvm-mc)
projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_transpiler_e2e.py End-to-end asm→disasm→translate→asm validation script
projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/test_rules.json Example rule file for rewrite engine testing
projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/hotswap_test.cpp Standalone C++ tests for rules + trampoline encoding
projects/rocr-runtime/runtime/hsa-runtime/hotswap/tests/pycache/test_transpiler_e2e.cpython-312.pyc Adds compiled artifact (should not be committed)
projects/rocr-runtime/runtime/hsa-runtime/hotswap/CMakeLists.txt Standalone CMake build for HotSwap library
projects/rocr-runtime/runtime/hsa-runtime/core/runtime/runtime.cpp Coalesce async wake requests + reserve async event buffers
projects/rocr-runtime/runtime/hsa-runtime/core/inc/runtime.h Adds wake coalescing + reserve API declarations
projects/rocr-runtime/runtime/hsa-runtime/CMakeLists.txt Adds ROCR_ENABLE_HOTSWAP option and LLVM linkage
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/registration.cpp Gates late register propagation behind env var
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp Adds deferred signal retirement + queue state atomics
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp Implements deferred destruction + async handler bookkeeping changes
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp Marks queue to_destroy before sync/destroy
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/details/fmt.hpp Improves memory-copy-op formatting for extended layouts
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/context/correlation_id.cpp Switches latest correlation-id storage to std::vector
projects/clr/hipamd/src/hip_fatbin.cpp Adds HotSwap-driven ISA override/retarget fallback for fatbin extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@powderluv
Copy link
Copy Markdown
Author

Clean PR-only re-verification on the HIP graph reproducer:

I rebuilt from a clean worktree containing only the 4 PR commits and reran the reproducer against those staged artifacts.

One correction from the earlier local note: the first rerun was still mixing stacks because the launcher could fall back into the venv ROCm tree unless the staged rocprofiler-sdk and staged ROCr libraries were placed first and only the venv sysdeps/LLVM directories were appended for dependencies. With that corrected launch environment, the clean PR stack behaves like this:

  • 1000 kernels x 300 iterations: passes and writes both kernel trace and agent info CSVs. trace_kernel_trace.csv has 300000 rows and 300000 unique Dispatch_Id values.
  • 2000 kernels x 200 iterations: still fails after 50 iterations with repeated HSA_STATUS_ERROR_INVALID_AGENT from hsa_amd_profiling_get_dispatch_time, then malloc_consolidate(): invalid chunk size; no CSV output is produced.

Conclusion: this PR improves the reproducer enough to make the 300k-dispatch case succeed, but it does not fully fix the strongest 400k-dispatch case yet.

This comment supersedes the earlier local verification note that was based on a mixed stack.

@powderluv powderluv marked this pull request as draft March 19, 2026 15:54
@powderluv
Copy link
Copy Markdown
Author

Checked the Copilot review threads after the branch rewrite.

All 10 Copilot comments are attached to the earlier hotswap/CLR diff, not to the current PR contents. The live PR file list now contains only these 8 files:

  • projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/context/correlation_id.cpp
  • projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/details/fmt.hpp
  • projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp
  • projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp
  • projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp
  • projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/registration.cpp
  • projects/rocr-runtime/runtime/hsa-runtime/core/inc/runtime.h
  • projects/rocr-runtime/runtime/hsa-runtime/core/runtime/runtime.cpp

So there are no current Copilot findings on the active patch set to address in code. I resolved the stale threads to reduce review noise.

private:
AsyncEventsInfo* info_;
os::Thread thread_;
std::atomic<bool> wake_pending_;
Copy link
Copy Markdown
Contributor

@gandryey gandryey Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is unnecessary. The call below is just an atomic operation and KFD call that should wakeup the async thread is already under extra protection
hsa_signal_handle(asyncInfo->control.wake)->StoreRelease(1);

void InterruptSignal::StoreRelease(hsa_signal_value_t value) {
atomic::Store(&signal_.value, int64_t(value), std::memory_order_release);
SetEvent();
}
void InterruptSignal::SetEvent() {
if (InWaiting()) HSAKMT_CALL(hsaKmtSetEvent(event_));
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InWaiting() only gates whether SetEvent() makes the KFD wake syscall at that instant. It does not coalesce repeated logical wake requests while the async thread is still working through a large pending batch.

The issue we were chasing here was not just too many KFD wake syscalls while the thread is asleep, but repeated re-wakeup pressure during HIP-graph kernel-trace bursts where registrations were arriving faster than the async loop could drain them. With RequestWake()/ResetWake(), we allow at most one outstanding wake request until the loop reaches a point where it has observed and drained the current batch, then we clear the pending bit and permit the next wake.

So the intended difference is:

  • InWaiting() avoids an unnecessary kernel wake when the thread is not blocked in KFD.
  • wake_pending_ avoids repeatedly signaling the same outstanding work before the async loop has had a chance to consume it.

In our local HIP-graph kernel-trace repro this was not just theoretical; the coalescing reduced async wake pressure materially. I can add a short code comment in RequestWake() as well if that would make the intent clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants