Skip to content

[rocprofiler-systems] AMD SMI collector using policy-based design#3703

Merged
adjordje-amd merged 26 commits intodevelopfrom
users/adjordje-amd/pmc-collector
Mar 30, 2026
Merged

[rocprofiler-systems] AMD SMI collector using policy-based design#3703
adjordje-amd merged 26 commits intodevelopfrom
users/adjordje-amd/pmc-collector

Conversation

@adjordje-amd
Copy link
Copy Markdown
Contributor

@adjordje-amd adjordje-amd commented Mar 3, 2026

Motivation

The current AMD SMI implementation faces several challenges related to its maintainability and architecture, which this refactor aims to address:

Monolithic Design: The original amd_smi.cpp file, comprising 1354 lines, merges sampling, caching, and output logic. This tight coupling makes it challenging to modify or extend individual components.

Lack of Testability: Due to the integration with AMD SMI hardware APIs, the implementation is difficult to unit test and necessitates real hardware for verification.

Limited Flexibility: Introducing support for new output formats, such as RocPD alongside Perfetto, involves substantial code duplication.

Accumulated Bugs: The implementation contains several persistent bugs

The refactoring process aims to tackle these issues by replacing the monolithic setup with a contemporary, policy-based design. This new approach emphasizes modularity, testability, and enhanced performance

Technical Details

New architecture
image

JIRA ID

  • AIPROFSYST-13
  • AIPROFSYST-28
  • AIPROFSYST-29

Test Plan

Test Result

Submission Checklist

@adjordje-amd adjordje-amd requested review from a team and jrmadsen as code owners March 3, 2026 16:13
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from ddfae4b to 13392da Compare March 10, 2026 15:13
@adjordje-amd adjordje-amd changed the title [DNM] [rocprofiler-systems] AMD SMI collector using policy-based design [rocprofiler-systems] AMD SMI collector using policy-based design Mar 10, 2026
Rewrite AMD SMI and AINIC collectors using a policy-based design pattern
that enables code reuse and extensibility. Key changes:

- Add base collector template with configurable policies for caching,
  Perfetto output, and device handling
- Implement GPU collector with SDMA utilization metrics
- Implement NIC collector for AINIC RDMA metrics
- Reorganize collector files into unified hierarchy under pmc/
- Update trace cache processors for new sample types
- Add comprehensive unit tests with mock drivers
- Replace std::stringstream with fmt::format in types.hpp
- Use std::vector instead of std::unordered_map for SDMA states
- Template sample() callback to avoid std::function overhead
- Fix CMake include dirs and link GPU/NIC tests to unit test target
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from 13392da to 3710324 Compare March 12, 2026 17:56
Remove debug printf statement left in perfetto_policy.hpp that was
printing JPEG activity status to stdout in production.

Also apply clang-format-18 to fix alignment issues in:
- sample_processor.hpp (type alias alignment)
- base/collector.hpp (comment wrapping, blank line)
- gpu/collector.hpp (LOG_ERROR formatting, blank line)
Address PR review feedback and modernize PMC collectors:

- Replace strcmp chain with hash map for O(1) lookup (NIC device)
- Move static storage into policy classes for encapsulation
- Convert all namespaces to C++17 nested syntax (18 files)
- Replace timemory::join with fmt::format for type safety

Net reduction: 163 lines. Build verified.
Address PR review feedback and modernize PMC collectors:

- Replace strcmp chain with hash map for O(1) lookup (NIC device)
- Move static storage into policy classes for encapsulation
- Convert all namespaces to C++17 nested syntax (18 files)
- Replace timemory::join with fmt::format for type safety

Net reduction: 163 lines. Build verified.
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from e946934 to 6dff09c Compare March 12, 2026 19:57
Copy link
Copy Markdown
Contributor

@dgaliffiAMD dgaliffiAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the new files, but here are some initial thoughts.

It looks like the AI-NIC addition is breaking the back-compat. configurations. It probably just needs that preprocessor version check.

Please add and update to CHANGELOGS.

I'm seeing some unit test failures locally but haven't had a change to root cause them yet.

1: [==========] 443 tests from 24 test suites ran. (2660 ms total)
1: [  PASSED  ] 428 tests.
1: [  FAILED  ] 15 tests, listed below:
1: [  FAILED  ] DeviceTest.device_construction_no_support
1: [  FAILED  ] DeviceTest.device_construction_partial_support
1: [  FAILED  ] DeviceTest.edge_temperature_collection
1: [  FAILED  ] DeviceTest.jpeg_activity_collection_all_xcps
1: [  FAILED  ] DeviceTest.xcp_metrics_not_collected_when_unsupported
1: [  FAILED  ] DeviceTest.mixed_vcn_jpeg_support
1: [  FAILED  ] DeviceTest.all_metrics_supported_detection
1: [  FAILED  ] DeviceTest.vcn_activity_support_detection_any_xcp
1: [  FAILED  ] DeviceTest.vcn_activity_unsupported_all_sentinels
1: [  FAILED  ] DeviceTest.jpeg_activity_support_detection_any_xcp
1: [  FAILED  ] DeviceTest.vcn_activity_top_level_field_only
1: [  FAILED  ] DeviceTest.vcn_activity_in_both_fields
1: [  FAILED  ] DeviceTest.vcn_activity_detection_should_check_both_sources
1: [  FAILED  ] DeviceTest.vcn_activity_xcp_disabled_top_level_valid
1: [  FAILED  ] DeviceTest.full_lifecycle_with_realistic_data
1: 
1: 15 FAILED TESTS

@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from c8a3965 to 27dd089 Compare March 13, 2026 11:35
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from 27dd089 to b815fa2 Compare March 13, 2026 15:31
The NIC collector was a standalone 264-line implementation that
duplicated much of the base::collector pattern. This refactoring
aligns it with the GPU collector approach for consistency and
maintainability.

Key changes:
- Create nic_traits.hpp with NIC-specific behavior (name-based
  filtering, device context caching, agent registration)
- Simplify collector.hpp to a type alias (264 → 24 lines)
- Make perfetto_policy::post_process_device() public for traits access
- Use uint64_t consistently for timestamps
- Add noexcept to simple getter methods

The traits class bridges NIC-specific requirements to base::collector:
- Name-based device filtering (vs GPU's index-based)
- Device context storage for APIs needing device_name/product_name
- Agent registration during enumeration
Add SDMA (System DMA) usage as a new GPU metric for PMC collection,
enabling monitoring of DMA engine utilization.

Key changes:
- Add sdma_usage metric (ID 14) with accessor, setter, and mask
- Add SDMA to user metric aliases and Perfetto track info
- Update metric validation pattern to accept "sdma_usage"
- Add pytest marker for SDMA tests

Additional fixes:
- Change get_use_rocpd/get_caching_perfetto to return bool by value
- Fix store_sample parameter order (enabled before supported)
- Use core/amd_smi.hpp wrapper instead of direct amdsmi.h include
- Change sample.device_id from size_t to uint32_t
- Add debug logging for perfetto processor creation
Save point before implementing device_view type erasure.

Fix DeviceTest unit tests failing due to missing mock expectation
for get_gpu_asic_info(), which is now called during device
initialization to populate vendor/product names.
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from 013b786 to cfa3744 Compare March 17, 2026 13:45
Fix issues identified during PR review:

Log level corrections:
- Revert LOG_INFO to LOG_TRACE in agent_manager (routine operation)
- Remove debug LOG_INFO messages from cache_manager
- Change LOG_INFO to LOG_TRACE in rocpd_processor (hot path)
- Change LOG_INFO to LOG_DEBUG in perfetto_policy post-processing

Dead code removal:
- Remove unused device_desc variable in nic/cache_policy
- Remove uninitialized sample_metrics method from device_slice
- Remove commented-out code and unused typeinfo include

Move semantics fix:
- Add noexcept to provider destructor
- Implement explicit move constructor/assignment to prevent
  double-shutdown of AMD SMI driver in moved-from objects
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from cfa3744 to ff2044c Compare March 17, 2026 13:54
The NIC/AINIC AMD SMI APIs (amdsmi_nic_asic_info_t, amdsmi_get_nic_*,
AMDSMI_PROCESSOR_TYPE_AMD_NIC, etc.) are only available in AMD SMI
>= 26.3 (ROCm 7.3+). CI tests against older ROCm versions (6.3-7.2)
were failing because these APIs were used unconditionally.

This commit adds #if defined(ROCPROFSYS_BUILD_AINIC) guards around:
- NIC API wrapper functions in driver.hpp
- AMDSMI_PROCESSOR_TYPE_AMD_NIC usage in provider.hpp

This matches the existing pattern used for SDMA support with
AMD_SMI_SDMA_SUPPORTED.
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from cbaf852 to 5c708a2 Compare March 19, 2026 18:49
@adjordje-amd adjordje-amd force-pushed the users/adjordje-amd/pmc-collector branch from 5c708a2 to 5879cac Compare March 20, 2026 10:21
Copy link
Copy Markdown
Contributor

@dgaliffiAMD dgaliffiAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested on my "Navi" system. The output generated from the transpose and decode tests look good.
I haven't yet tested it on an MI system. I suspect those CI failures are MI-specific. Is it due to the XCP metrics?

Copy link
Copy Markdown
Contributor

@marantic-amd marantic-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nit-pics and suggestions. Looks good.

Track metadata registration was called inside the per-device loop,
causing redundant re-initialization for each device. Move it before
the loop alongside category metadata initialization, which is the
correct initialization order.
@adjordje-amd adjordje-amd merged commit 5e28f59 into develop Mar 30, 2026
52 of 54 checks passed
@adjordje-amd adjordje-amd deleted the users/adjordje-amd/pmc-collector branch March 30, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants