[rocprofiler-sdk] Implement HSA profiler serialization with ready_signal support and improve testing infrastructure#192
[rocprofiler-sdk] Implement HSA profiler serialization with ready_signal support and improve testing infrastructure#192
Conversation
…used with HSA calls (#192) * Force HSA_AMD_MEMORY_POOL_EXECUTABLE_FLAG value to be used with HSA calls Fix for CI * More tweaks * Increase reproducible-runtime kernel sleep granularity * Fix data race in synchronous device counter collection sample * Update device counting service - add get_active_context function --------- Co-authored-by: Benjamin Welton <bewelton@amd.com> Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com> [ROCm/rocprofiler-sdk commit: 080b2ba]
- Fix overlapping VCN and JPEG activity values in Perfetto output. - Modify the storage of the activity values to be more efficient. [ROCm/rocprofiler-systems commit: 99a411f]
debe4a7 to
cd72398
Compare
|
Large PR because of the migration to systems, which destroyed the stacked PR. |
| { | ||
| printf("[error] Expected %i but got %i\n", expected, *actual); | ||
| } | ||
| assert(*actual == expected); |
There was a problem hiding this comment.
No, if it fails, we want the CI to trigger and fail this sample/test (this should never fail).
There was a problem hiding this comment.
Right, but unless we remove NDEBUG, this check will not be there in Release builds.
There was a problem hiding this comment.
Oh thats what you are asking about, changed. Thanks for the comment here!
projects/rocprofiler-sdk/samples/counter_collection/device_serialized_main.cpp
Outdated
Show resolved
Hide resolved
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.hpp
Show resolved
Hide resolved
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/profile_serializer.cpp
Show resolved
Hide resolved
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com [ROCm/rdc commit: b8f0ab3]
| { | ||
| printf("[error] Expected %i but got %i\n", expected, *actual); | ||
| } | ||
| assert(*actual == expected); |
There was a problem hiding this comment.
Right, but unless we remove NDEBUG, this check will not be there in Release builds.
| HIP_CALL(hipSetDevice(0)); | ||
| DualStreamExecutor executor(0); | ||
| *no_opt_0 = 0; | ||
| srand((unsigned int) time(NULL)); |
There was a problem hiding this comment.
Maybe we should have a reproducible seed here. Otherwise it becomes impossible to debug if there is a random error in CI
|
|
||
| // Add a second barrier packet for memory synchronization so that the | ||
| // block_signal completion update is visible to CP | ||
| ret.push_back(CreateBarrierPacket(nullptr, nullptr)); |
There was a problem hiding this comment.
I dont get this one. Why would the block signal not be visible to CP?
There was a problem hiding this comment.
It shouldn't be needed, this was left in from testing, removed.
| // Finalize queue destruction | ||
| auto* controller = CHECK_NOTNULL(get_queue_controller()); | ||
| controller->set_queue_state(queue_state::to_destroy, id); | ||
| controller->get_core_table().hsa_signal_store_screlease_fn(queue.block_signal, 0); |
There was a problem hiding this comment.
| controller->get_core_table().hsa_signal_store_screlease_fn(queue.block_signal, 0); | |
| controller->get_core_table().hsa_signal_store_screlease_fn(queue.block_signal, RELEASE_BARRIER); |
| // Notify any waiters on the ready condition variable | ||
| { | ||
| std::lock_guard<std::mutex> lock(queue.cv_mutex); | ||
| queue.cv_ready_signal.notify_all(); |
There was a problem hiding this comment.
I don't see anyone using this. I think we can remove.
There was a problem hiding this comment.
Pull Request Overview
This PR implements a comprehensive dual-signal serialization mechanism for HSA kernel execution profiling to ensure accurate hardware counter measurements by enforcing strict single-kernel execution across all profiled queues. It also centralizes GPU target configuration and enhances testing infrastructure.
Key Changes:
- Added dual-signal control mechanism (block_signal + ready_signal) for HSA profiler serialization
- Centralized GPU targets configuration across CMake build system with ROCPROFILER_GPU_TARGETS option
- Enhanced testing infrastructure with DualStreamExecutor class and device serialization validation
Reviewed Changes
Copilot reviewed 14 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rocprofiler-sdk/tests/common/CMakeLists.txt | Replaced hardcoded GPU targets with centralized ROCPROFILER_GPU_TARGETS variable |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.hpp | Added to_string() method and fmt formatter for debugging, plus debug print function |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp | Implemented to_string() method and debug state printing functionality |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp | Added to_string() method and fmt formatter for Queue debugging |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp | Implemented to_string() method with detailed signal state information |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/profile_serializer.hpp | Major redesign with comprehensive documentation and new ready_signal support |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/profile_serializer.cpp | Complete rewrite implementing dual-signal serialization mechanism |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/hsa_barrier.hpp | Added to_string() method and fmt formatter for barrier debugging |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp | Implemented to_string() method for barrier state visualization |
| projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/evaluate_ast.cpp | Enhanced error message with status code for AQL packet decoding failures |
| projects/rocprofiler-sdk/samples/openmp_target/CMakeLists.txt | Updated to use centralized ROCPROFILER_GPU_TARGETS variable |
| projects/rocprofiler-sdk/samples/counter_collection/main.cpp | Added progress indicators for long-running kernel launches |
| projects/rocprofiler-sdk/samples/counter_collection/device_serialized_main.cpp | Added DualStreamExecutor class and kernel order validation testing |
| projects/rocprofiler-sdk/cmake/rocprofiler_options.cmake | Added ROCPROFILER_GPU_TARGETS option with default GPU target list |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
[nitpick] Using const_cast to remove const qualifier is generally a code smell and should be avoided. Consider modifying the Queue::intercept_queue() method to return a non-const pointer if it's safe to do so, or verify that the HSA API function doesn't modify the queue data.
| // Set up async signal handler for ready_signal when it equals -1 | |
| // The handler will be called when ready_signal transitions to -1 | |
| hsa_status_t status = controller.get_ext_table().hsa_amd_signal_async_handler_fn( | |
| queue.ready_signal, | |
| HSA_SIGNAL_CONDITION_EQ, | |
| -1, // Trigger value | |
| profiler_serializer_ready_signal_handler, | |
| const_cast<hsa_queue_t*>(queue.intercept_queue())); | |
| queue.intercept_queue()); |
There was a problem hiding this comment.
Manipulating NDEBUG to force assert behavior in release builds is problematic and can cause undefined behavior. Consider using a custom assertion macro or explicit error handling instead of redefining system macros.
There was a problem hiding this comment.
[nitpick] Using the global srand() function can affect other parts of the program. Consider using a local random number generator (e.g., std::mt19937) with explicit state management for better isolation.
33cecf6 to
cde1a90
Compare
…prove testing infrastructure This commit introduces a comprehensive dual-signal serialization mechanism for HSA kernel execution profiling, ensuring accurate hardware counter measurements by enforcing strict single-kernel execution across all profiled queues. Key Changes: HSA Profiler Serialization Enhancements: - Add ready_signal support to track when queues reach blocking barriers - Implement dual-signal control mechanism (block_signal + ready_signal) - Add queue_ready() handler for barrier notification events - Implement register_queue_ready_handler() for async signal management - Add select_and_grant_ready_queue() to centralize queue selection logic - Rename _dispatch_ready to _enqueued_kernels for clarity - Add _ready_queues multiset to track barrier-ready queues - Add _seen_queues to prevent duplicate handler registration - Ensure only queues that have signaled ready can execute - Add comprehensive trace logging for debugging serialization issues Build System Improvements: - Centralize GPU targets configuration across CMake build system - Add ROCPROFILER_GPU_TARGETS option with default support for: gfx906, gfx908, gfx90a, gfx942, gfx950, gfx1100, gfx1101, gfx1102 - Simplify GPU target management for tests and samples Testing Infrastructure: - Add DualStreamExecutor class for multi-stream kernel testing - Add check_order_kernel to verify serialization correctness - Enhance device serialization tests with barrier validation - Fix test deadlocks in hsa_code_object_app Debugging and Observability: - Add comprehensive to_string() methods for Queue and profiler_serializer - Include signal values (block_signal, ready_signal) in debug output - Add _ready_queues information to serializer state dumps - Enhance logging with detailed state transitions - Add trace-level logging for serialization flow analysis Bug Fixes: - Fix potential deadlock when no queues signal ready - Fix race condition in ready_signal handler registration - Ensure proper cleanup in destroy_queue with error reporting - Fix const correctness issues in queue controller This implementation ensures deterministic kernel execution ordering for accurate profiling while maintaining hardware-level synchronization through HSA barrier packets. The ready_signal mechanism provides visibility into queue readiness state, preventing deadlocks and ensuring efficient queue scheduling.
df96b7c to
51444b6
Compare
jrmadsen
left a comment
There was a problem hiding this comment.
Only reviewed the cmake changes and they are invalid.
|
|
||
| set(DEFAULT_GPU_TARGETS "gfx906" "gfx908" "gfx90a" "gfx942" "gfx950" "gfx1100" "gfx1101" | ||
| "gfx1102") | ||
| set(DEFAULT_GPU_TARGETS ${ROCPROFILER_GPU_TARGETS}) |
There was a problem hiding this comment.
This is invalid when samples are build against an install -- ROCPROFILER_GPU_TARGETS is not set. Furthermore, all of these targets may not be supported by OpenMP
|
|
||
| set(OPENMP_GPU_TARGETS | ||
| "${DEFAULT_GPU_TARGETS}" | ||
| "${ROCPROFILER_GPU_TARGETS}" |
There was a problem hiding this comment.
This is invalid when tests are build against an install of rocprofiler-sdk -- ROCPROFILER_GPU_TARGETS is not set. Furthermore, not all these targets may be supported by OpenMP.
| "gfx1100" | ||
| "gfx1101" | ||
| "gfx1102") | ||
| set(DEFAULT_GPU_TARGETS ${ROCPROFILER_GPU_TARGETS}) |
There was a problem hiding this comment.
This is invalid when samples are build against an install of rocprofiler-sdk -- ROCPROFILER_GPU_TARGETS is not set.
|
Not needed anymore. |
…MI_STATUS_NO_PERM (#192) * [SWDEV-513807] Fix amd-smi partition --accelerator not returning AMDSMI_STATUS_NO_PERM Changes: - Fixed amdsmi_get_gpu_accelerator_partition_profile_config() from not returning AMDSMI_STATUS_NO_PERM - Changed amd-smi partition --accelerator to provide user with a warning if users does not use sudo or root permissions. - Updated changelog for fixes planned for 6.4.1 release Signed-off-by: Charis Poag <Charis.Poag@amd.com>
…MI_STATUS_NO_PERM (#192) * [SWDEV-513807] Fix amd-smi partition --accelerator not returning AMDSMI_STATUS_NO_PERM Changes: - Fixed amdsmi_get_gpu_accelerator_partition_profile_config() from not returning AMDSMI_STATUS_NO_PERM - Changed amd-smi partition --accelerator to provide user with a warning if users does not use sudo or root permissions. - Updated changelog for fixes planned for 6.4.1 release Signed-off-by: Charis Poag <Charis.Poag@amd.com> [ROCm/amdsmi commit: 0402bb4]
…MI_STATUS_NO_PERM (#192) * [SWDEV-513807] Fix amd-smi partition --accelerator not returning AMDSMI_STATUS_NO_PERM Changes: - Fixed amdsmi_get_gpu_accelerator_partition_profile_config() from not returning AMDSMI_STATUS_NO_PERM - Changed amd-smi partition --accelerator to provide user with a warning if users does not use sudo or root permissions. - Updated changelog for fixes planned for 6.4.1 release Signed-off-by: Charis Poag <Charis.Poag@amd.com> [ROCm/amdsmi commit: 0402bb4]



This commit introduces a comprehensive dual-signal serialization mechanism for HSA kernel execution profiling, ensuring accurate hardware counter measurements by enforcing strict single-kernel execution across all profiled queues.
Key Changes:
HSA Profiler Serialization Enhancements:
Build System Improvements:
Testing Infrastructure:
Debugging and Observability:
Bug Fixes:
This implementation ensures deterministic kernel execution ordering for accurate profiling while maintaining hardware-level synchronization through HSA barrier packets. The ready_signal mechanism provides visibility into queue readiness state, preventing deadlocks and ensuring efficient queue scheduling.