Skip to content

[rocprofiler-sdk] Implement HSA profiler serialization with ready_signal support and improve testing infrastructure#192

Closed
bwelton wants to merge 5 commits intodevelopfrom
bewelton/hsa_serializer
Closed

[rocprofiler-sdk] Implement HSA profiler serialization with ready_signal support and improve testing infrastructure#192
bwelton wants to merge 5 commits intodevelopfrom
bewelton/hsa_serializer

Conversation

@bwelton
Copy link
Contributor

@bwelton bwelton commented Aug 8, 2025

This commit introduces a comprehensive dual-signal serialization mechanism for HSA kernel execution profiling, ensuring accurate hardware counter measurements by enforcing strict single-kernel execution across all profiled queues.

Key Changes:

HSA Profiler Serialization Enhancements:

  • Add ready_signal support to track when queues reach blocking barriers
  • Implement dual-signal control mechanism (block_signal + ready_signal)
  • Add queue_ready() handler for barrier notification events
  • Implement register_queue_ready_handler() for async signal management
  • Add select_and_grant_ready_queue() to centralize queue selection logic
  • Rename _dispatch_ready to _enqueued_kernels for clarity
  • Add _ready_queues multiset to track barrier-ready queues
  • Add _seen_queues to prevent duplicate handler registration
  • Ensure only queues that have signaled ready can execute
  • Add comprehensive trace logging for debugging serialization issues

Build System Improvements:

  • Centralize GPU targets configuration across CMake build system
  • Add ROCPROFILER_GPU_TARGETS option with default support for: gfx906, gfx908, gfx90a, gfx942, gfx950, gfx1100, gfx1101, gfx1102
  • Simplify GPU target management for tests and samples

Testing Infrastructure:

  • Add DualStreamExecutor class for multi-stream kernel testing
  • Add check_order_kernel to verify serialization correctness
  • Enhance device serialization tests with barrier validation
  • Fix test deadlocks in hsa_code_object_app

Debugging and Observability:

  • Add comprehensive to_string() methods for Queue and profiler_serializer
  • Include signal values (block_signal, ready_signal) in debug output
  • Add _ready_queues information to serializer state dumps
  • Enhance logging with detailed state transitions
  • Add trace-level logging for serialization flow analysis

Bug Fixes:

  • Fix potential deadlock when no queues signal ready
  • Fix race condition in ready_signal handler registration
  • Ensure proper cleanup in destroy_queue with error reporting
  • Fix const correctness issues in queue controller

This implementation ensures deterministic kernel execution ordering for accurate profiling while maintaining hardware-level synchronization through HSA barrier packets. The ready_signal mechanism provides visibility into queue readiness state, preventing deadlocks and ensuring efficient queue scheduling.

@bwelton bwelton requested review from jrmadsen and t-tye as code owners August 8, 2025 06:05
@bwelton bwelton changed the title Implement HSA profiler serialization with ready_signal support and improve testing infrastructure [rocprofiler-sdk] Implement HSA profiler serialization with ready_signal support and improve testing infrastructure Aug 8, 2025
ywang103-amd pushed a commit that referenced this pull request Aug 8, 2025
…used with HSA calls (#192)

* Force HSA_AMD_MEMORY_POOL_EXECUTABLE_FLAG  value to be used with HSA calls

Fix for CI

* More tweaks

* Increase reproducible-runtime kernel sleep granularity

* Fix data race in synchronous device counter collection sample

* Update device counting service

- add get_active_context function

---------

Co-authored-by: Benjamin Welton <bewelton@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>

[ROCm/rocprofiler-sdk commit: 080b2ba]
ajanicijamd pushed a commit to ajanicijamd/rocm-systems that referenced this pull request Aug 8, 2025
- Fix overlapping VCN and JPEG activity values in Perfetto output.
- Modify the storage of the activity values to be more efficient.

[ROCm/rocprofiler-systems commit: 99a411f]
@bwelton bwelton force-pushed the bewelton/hsa_serializer branch from debe4a7 to cd72398 Compare August 8, 2025 19:27
@bwelton
Copy link
Contributor Author

bwelton commented Aug 8, 2025

Review paid: #195 #196

@bwelton
Copy link
Contributor Author

bwelton commented Aug 8, 2025

Large PR because of the migration to systems, which destroyed the stacked PR.

{
printf("[error] Expected %i but got %i\n", expected, *actual);
}
assert(*actual == expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe needs NDEBUG?

Copy link
Contributor Author

@bwelton bwelton Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if it fails, we want the CI to trigger and fail this sample/test (this should never fail).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but unless we remove NDEBUG, this check will not be there in Release builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thats what you are asking about, changed. Thanks for the comment here!

jayhawk-commits pushed a commit that referenced this pull request Aug 8, 2025
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com

[ROCm/rdc commit: b8f0ab3]
@ammarwa
Copy link
Collaborator

ammarwa commented Aug 13, 2025

Code Coverage Report

Code Coverage Report

Tests Only

code coverage tests.png

Samples Only

code coverage samples.png

Tests + Samples

code coverage all.png

{
printf("[error] Expected %i but got %i\n", expected, *actual);
}
assert(*actual == expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but unless we remove NDEBUG, this check will not be there in Release builds.

HIP_CALL(hipSetDevice(0));
DualStreamExecutor executor(0);
*no_opt_0 = 0;
srand((unsigned int) time(NULL));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a reproducible seed here. Otherwise it becomes impossible to debug if there is a random error in CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done


// Add a second barrier packet for memory synchronization so that the
// block_signal completion update is visible to CP
ret.push_back(CreateBarrierPacket(nullptr, nullptr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get this one. Why would the block signal not be visible to CP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be needed, this was left in from testing, removed.

// Finalize queue destruction
auto* controller = CHECK_NOTNULL(get_queue_controller());
controller->set_queue_state(queue_state::to_destroy, id);
controller->get_core_table().hsa_signal_store_screlease_fn(queue.block_signal, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
controller->get_core_table().hsa_signal_store_screlease_fn(queue.block_signal, 0);
controller->get_core_table().hsa_signal_store_screlease_fn(queue.block_signal, RELEASE_BARRIER);

// Notify any waiters on the ready condition variable
{
std::lock_guard<std::mutex> lock(queue.cv_mutex);
queue.cv_ready_signal.notify_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anyone using this. I think we can remove.

Copilot AI review requested due to automatic review settings September 4, 2025 19:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive dual-signal serialization mechanism for HSA kernel execution profiling to ensure accurate hardware counter measurements by enforcing strict single-kernel execution across all profiled queues. It also centralizes GPU target configuration and enhances testing infrastructure.

Key Changes:

  • Added dual-signal control mechanism (block_signal + ready_signal) for HSA profiler serialization
  • Centralized GPU targets configuration across CMake build system with ROCPROFILER_GPU_TARGETS option
  • Enhanced testing infrastructure with DualStreamExecutor class and device serialization validation

Reviewed Changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
projects/rocprofiler-sdk/tests/common/CMakeLists.txt Replaced hardcoded GPU targets with centralized ROCPROFILER_GPU_TARGETS variable
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.hpp Added to_string() method and fmt formatter for debugging, plus debug print function
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp Implemented to_string() method and debug state printing functionality
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp Added to_string() method and fmt formatter for Queue debugging
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp Implemented to_string() method with detailed signal state information
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/profile_serializer.hpp Major redesign with comprehensive documentation and new ready_signal support
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/profile_serializer.cpp Complete rewrite implementing dual-signal serialization mechanism
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/hsa_barrier.hpp Added to_string() method and fmt formatter for barrier debugging
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp Implemented to_string() method for barrier state visualization
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/evaluate_ast.cpp Enhanced error message with status code for AQL packet decoding failures
projects/rocprofiler-sdk/samples/openmp_target/CMakeLists.txt Updated to use centralized ROCPROFILER_GPU_TARGETS variable
projects/rocprofiler-sdk/samples/counter_collection/main.cpp Added progress indicators for long-running kernel launches
projects/rocprofiler-sdk/samples/counter_collection/device_serialized_main.cpp Added DualStreamExecutor class and kernel order validation testing
projects/rocprofiler-sdk/cmake/rocprofiler_options.cmake Added ROCPROFILER_GPU_TARGETS option with default GPU target list

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 419 to 426
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using const_cast to remove const qualifier is generally a code smell and should be avoided. Consider modifying the Queue::intercept_queue() method to return a non-const pointer if it's safe to do so, or verify that the HSA API function doesn't modify the queue data.

Suggested change
// Set up async signal handler for ready_signal when it equals -1
// The handler will be called when ready_signal transitions to -1
hsa_status_t status = controller.get_ext_table().hsa_amd_signal_async_handler_fn(
queue.ready_signal,
HSA_SIGNAL_CONDITION_EQ,
-1, // Trigger value
profiler_serializer_ready_signal_handler,
const_cast<hsa_queue_t*>(queue.intercept_queue()));
queue.intercept_queue());

Copilot uses AI. Check for mistakes.
Comment on lines 54 to 59
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manipulating NDEBUG to force assert behavior in release builds is problematic and can cause undefined behavior. Consider using a custom assertion macro or explicit error handling instead of redefining system macros.

Copilot uses AI. Check for mistakes.
Comment on lines 172 to 173
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using the global srand() function can affect other parts of the program. Consider using a local random number generator (e.g., std::mt19937) with explicit state management for better isolation.

Copilot uses AI. Check for mistakes.
@bwelton bwelton force-pushed the bewelton/hsa_serializer branch 4 times, most recently from 33cecf6 to cde1a90 Compare September 10, 2025 21:50
bwelton and others added 5 commits September 11, 2025 01:05
…prove testing infrastructure

This commit introduces a comprehensive dual-signal serialization mechanism for HSA kernel
execution profiling, ensuring accurate hardware counter measurements by enforcing strict
single-kernel execution across all profiled queues.

Key Changes:

HSA Profiler Serialization Enhancements:
- Add ready_signal support to track when queues reach blocking barriers
- Implement dual-signal control mechanism (block_signal + ready_signal)
- Add queue_ready() handler for barrier notification events
- Implement register_queue_ready_handler() for async signal management
- Add select_and_grant_ready_queue() to centralize queue selection logic
- Rename _dispatch_ready to _enqueued_kernels for clarity
- Add _ready_queues multiset to track barrier-ready queues
- Add _seen_queues to prevent duplicate handler registration
- Ensure only queues that have signaled ready can execute
- Add comprehensive trace logging for debugging serialization issues

Build System Improvements:
- Centralize GPU targets configuration across CMake build system
- Add ROCPROFILER_GPU_TARGETS option with default support for:
  gfx906, gfx908, gfx90a, gfx942, gfx950, gfx1100, gfx1101, gfx1102
- Simplify GPU target management for tests and samples

Testing Infrastructure:
- Add DualStreamExecutor class for multi-stream kernel testing
- Add check_order_kernel to verify serialization correctness
- Enhance device serialization tests with barrier validation
- Fix test deadlocks in hsa_code_object_app

Debugging and Observability:
- Add comprehensive to_string() methods for Queue and profiler_serializer
- Include signal values (block_signal, ready_signal) in debug output
- Add _ready_queues information to serializer state dumps
- Enhance logging with detailed state transitions
- Add trace-level logging for serialization flow analysis

Bug Fixes:
- Fix potential deadlock when no queues signal ready
- Fix race condition in ready_signal handler registration
- Ensure proper cleanup in destroy_queue with error reporting
- Fix const correctness issues in queue controller

This implementation ensures deterministic kernel execution ordering for accurate
profiling while maintaining hardware-level synchronization through HSA barrier
packets. The ready_signal mechanism provides visibility into queue readiness
state, preventing deadlocks and ensuring efficient queue scheduling.
@bwelton bwelton force-pushed the bewelton/hsa_serializer branch from df96b7c to 51444b6 Compare September 11, 2025 08:05
Copy link
Contributor

@jrmadsen jrmadsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the cmake changes and they are invalid.


set(DEFAULT_GPU_TARGETS "gfx906" "gfx908" "gfx90a" "gfx942" "gfx950" "gfx1100" "gfx1101"
"gfx1102")
set(DEFAULT_GPU_TARGETS ${ROCPROFILER_GPU_TARGETS})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid when samples are build against an install -- ROCPROFILER_GPU_TARGETS is not set. Furthermore, all of these targets may not be supported by OpenMP


set(OPENMP_GPU_TARGETS
"${DEFAULT_GPU_TARGETS}"
"${ROCPROFILER_GPU_TARGETS}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid when tests are build against an install of rocprofiler-sdk -- ROCPROFILER_GPU_TARGETS is not set. Furthermore, not all these targets may be supported by OpenMP.

"gfx1100"
"gfx1101"
"gfx1102")
set(DEFAULT_GPU_TARGETS ${ROCPROFILER_GPU_TARGETS})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid when samples are build against an install of rocprofiler-sdk -- ROCPROFILER_GPU_TARGETS is not set.

@bgopesh
Copy link
Contributor

bgopesh commented Nov 4, 2025

Not needed anymore.

@bgopesh bgopesh closed this Nov 4, 2025
ammallya pushed a commit that referenced this pull request Nov 17, 2025
…MI_STATUS_NO_PERM (#192)

* [SWDEV-513807] Fix amd-smi partition --accelerator not returning AMDSMI_STATUS_NO_PERM

Changes:
- Fixed amdsmi_get_gpu_accelerator_partition_profile_config() from not
  returning AMDSMI_STATUS_NO_PERM
- Changed amd-smi partition --accelerator to provide user with a warning
  if users does not use sudo or root permissions.
- Updated changelog for fixes planned for 6.4.1 release

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
ammallya pushed a commit that referenced this pull request Nov 18, 2025
…MI_STATUS_NO_PERM (#192)

* [SWDEV-513807] Fix amd-smi partition --accelerator not returning AMDSMI_STATUS_NO_PERM

Changes:
- Fixed amdsmi_get_gpu_accelerator_partition_profile_config() from not
  returning AMDSMI_STATUS_NO_PERM
- Changed amd-smi partition --accelerator to provide user with a warning
  if users does not use sudo or root permissions.
- Updated changelog for fixes planned for 6.4.1 release

Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/amdsmi commit: 0402bb4]
ammallya pushed a commit that referenced this pull request Nov 21, 2025
…MI_STATUS_NO_PERM (#192)

* [SWDEV-513807] Fix amd-smi partition --accelerator not returning AMDSMI_STATUS_NO_PERM

Changes:
- Fixed amdsmi_get_gpu_accelerator_partition_profile_config() from not
  returning AMDSMI_STATUS_NO_PERM
- Changed amd-smi partition --accelerator to provide user with a warning
  if users does not use sudo or root permissions.
- Updated changelog for fixes planned for 6.4.1 release

Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/amdsmi commit: 0402bb4]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants