Skip to content

amd_smi: AMD GPU System Management Interface via AMD SMI library.#481

Merged
djwoun merged 1 commit intoicl-utk-edu:masterfrom
djwoun:feature/squashed-branch-amdsmi
Nov 4, 2025
Merged

amd_smi: AMD GPU System Management Interface via AMD SMI library.#481
djwoun merged 1 commit intoicl-utk-edu:masterfrom
djwoun:feature/squashed-branch-amdsmi

Conversation

@djwoun
Copy link
Contributor

@djwoun djwoun commented Sep 17, 2025

Pull Request Description

AMD SMI PAPI Component

Overview

This component adds AMD SMI to PAPI. It discovers GPUs at runtime and exposes AMD SMI-reported metrics (temperatures, power, fans, PCIe, RAS/ECC, cache/VRAM info, utilization, counters, etc.) as PAPI native events.

Targeted for ROCm 6.4.0 (AMD SMI ~25.3) and expected to work with ROCm(6.4.0 ~ 7.0.1) releases as well.
Tested on a combination of MI210 & MI300 for ROCm:

  • 6.4.0
  • 7.0.1

Usage

# point to your ROCm install (tested with 6.4.0)
export PAPI_AMDSMI_ROOT=/opt/rocm-6.4.0

# configure PAPI
./configure --prefix=${INSTDIR} --with-components="amd_smi"

How it fits together (high-level)

  • linux-amd-smi.c: implements the PAPI vector (init/start/stop/read/reset/etc.) and delegates to the internal amds_* APIs.
  • amds.c: dlopen AMD SMI, discover devices, build the native event table, and wire up accessors.
  • amds_accessors.c: one function per metric — actually calls AMD SMI library to read and write values.
  • amds_ctx.c: per-eventset lifecycle — open/close/start/stop/read/write/reset groups of events and enforce device usage.
  • amds_evtapi.c: native event enumeration — code↔name↔description helpers for PAPI.
  • amds_priv.h: internal types (e.g., native_event_t), globals, and AMD SMI function-pointer declarations.
  • amds_funcs.h: list of AMD SMI API calls used (generates the function-pointer declarations/definitions).
  • htable.h: lightweight string→event lookup (for fast name→code mapping).
  • amds.h: public “component-internal” API used across the above files.
  • Rules.amd_smi: build glue to include this component in PAPI.

File-by-file (concise)

  • linux-amd-smi.c
    Declares the papi_vector_t for this component; initializes on first use; hands off work to amds_* for device/event management; implements PAPI hooks (init_component, update_control_state, start, read, stop, reset, shutdown, and native-event queries).

  • amds.c
    Dynamically loads libamd_smi.so, resolves AMD SMI symbols, discovers sockets/devices, and builds the native event table. Defines helpers to add simple and counter-based events. Manages global teardown (destroy event table, close library).

  • amds_accessors.c
    Implements the accessors that read/write individual metrics (e.g., temperatures, fans, PCIe, energy, power caps, RAS/ECC, clocks, VRAM, link topology, XGMI/PCIe metrics, firmware/board info, etc.). Each accessor maps an event’s (variant, subvariant) to the right SMI call and returns the value.

  • amds_ctx.c
    Provides the per-eventset context:

    • amds_ctx_open/close — acquire/release devices, run per-event open/close hooks.
    • amds_ctx_start/stop — start/stop counters where needed.
    • amds_ctx_read/write/reset — read current values, optionally write supported controls (e.g., power cap), zero software view.
  • amds_evtapi.c
    Implements native-event enumeration for PAPI (enum, code_to_name, name_to_code, code_to_descr) using the in-memory event table and a small hash map for fast lookups.

  • amds_priv.h
    Internal definitions: native_event_t (name/descr/device/mode/value + open/close/start/stop/access callbacks), global getters, and the AMD SMI function-pointer declarations (via amds_funcs.h).

  • amds_funcs.h
    Centralized macro list of AMD SMI APIs used by the component; generates function-pointer declarations/definitions so amds.c can dlsym() them at runtime. Conditional entries handle newer SMI features.

  • htable.h
    Minimal chained hash table for name→event mapping; used by amds_evtapi.c to resolve native event names quickly.

  • amds.h
    Public, component-internal API across files: init/shutdown, native-event queries, context ops, and error-string retrieval.

  • Rules.amd_smi
    Build integration for PAPI’s make system; compiles this component and sets include/library paths for AMD SMI.


Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@djwoun djwoun force-pushed the feature/squashed-branch-amdsmi branch from 02941d9 to bdcd745 Compare September 17, 2025 06:07
@djwoun djwoun force-pushed the feature/squashed-branch-amdsmi branch 9 times, most recently from 0f4cbfb to 11c3c8e Compare September 25, 2025 08:56
Copy link
Contributor

@Treece-Burgess Treece-Burgess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this PR on Odyssey at Oregon, the exact setup is:
- GPU: Quad AMD MI300A
- CPU: Quad AMD MI300A
- OS: RHEL 8.10
- ROCm version: 7.0.1

Below outlines specific functions/utilities I tested:
- Building PAPI with ./configure --prefix=$PWD/test-install --with-components="amd_smi"
- PAPI utilities: papi_component_avail, papi_native_avail, and papi_command_line
- PAPI functions: PAPI_event_name_to_code, PAPI_event_code_to_name, PAPI_enum_cmp_event, PAPI_enum_event, PAPI_get_event_info, PAPI_get_component_info, PAPI_query_event, PAPI_query_named_event, PAPI_create_eventset, PAPI_add_event, PAPI_add_named_event, PAPI_start, PAPI_read, PAPI_stop, PAPI_reset, PAPI_num_events, PAPI_list_events, PAPI_cleanup_eventset, PAPI_destroy_eventset, and PAPI_shutdown

@dbarry9
Copy link
Contributor

dbarry9 commented Oct 10, 2025

I am testing this pull request on the Frontier supercomputer.

@djwoun djwoun force-pushed the feature/squashed-branch-amdsmi branch 2 times, most recently from 3e12e3c to 582731d Compare October 27, 2025 19:04
@djwoun djwoun force-pushed the feature/squashed-branch-amdsmi branch from 11ecd68 to a17c734 Compare October 30, 2025 18:31
@dbarry9
Copy link
Contributor

dbarry9 commented Oct 30, 2025

I have tested this PR on the Frontier supercomputer using ROCm 7.0.2. All component tests pass, with the exception of amdsmi_set_test:

WARNING: power_cap write failed: Unknown error code
Skipping fan_speed test: event unavailable
PASSED with WARNING

Note that I do not have certain permissions on this machine.

The PAPI utilities function properly, and the inclusion of the amd_smi component yields the following:

Name: amd_smi AMD GPU System Management Interface via AMD SMI library
Native: 356, Preset: 0, Counters: 0

and there are indeed 356 entries in for the amd_smi component in the output for papi_native_avail.
@djwoun Should the "Counters" be zero? For other components, such as rocp_sdk, "Counters" equals "Native"

Name: rocp_sdk GPU events and metrics via AMD ROCprofiler-SDK
Native: 530, Preset: 0, Counters: 530

@djwoun djwoun force-pushed the feature/squashed-branch-amdsmi branch from 738f63c to c27acef Compare November 3, 2025 21:37
@djwoun djwoun force-pushed the feature/squashed-branch-amdsmi branch from c27acef to 88fd7c9 Compare November 4, 2025 20:05
@djwoun djwoun merged commit fa446d3 into icl-utk-edu:master Nov 4, 2025
14 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants