Skip to content

amd_smi: add gpu_metrics accumulation and and residency counters#568

Open
djwoun wants to merge 2 commits intoicl-utk-edu:masterfrom
djwoun:amd-smi-gpu-metrics-events
Open

amd_smi: add gpu_metrics accumulation and and residency counters#568
djwoun wants to merge 2 commits intoicl-utk-edu:masterfrom
djwoun:amd-smi-gpu-metrics-events

Conversation

@djwoun
Copy link
Contributor

@djwoun djwoun commented Feb 25, 2026

Pull Request Description

Adds throttle accumulation/residency counters(9) and edit descriptions. Tested papi utilities and amd smi tests on MI300A ROCm 7.1.1 and 6.4.4

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@Treece-Burgess
Copy link
Contributor

I am reviewing this PR.


CHECK_EVENT_IDX(idx);
CHECK_SNPRINTF(name_buf, sizeof(name_buf), "vr_thm_residency_acc:device=%d", d);
CHECK_SNPRINTF(descr_buf, sizeof(descr_buf), "Device %d accumulated voltage regulator thermal throttler residency", d);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From papi_native_avail on Illyad, this metric outputs a newline as follows:

5530 --------------------------------------------------------------------------------
5531 | amd_smi:::vr_thm_residency_acc                                               |
5532 |            Device 0 accumulated voltage regulator thermal throttler residency|
5533 |                                                                              |
5534 |     :device=0                                                                |
5535 |            Mandatory device qualifier [0,1]                                  |
5536 --------------------------------------------------------------------------------

Is this from papi_native_avail formatting or is this coming from amd_smi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think so that line of text must be exactly 66 characeters and papi_native_avail just cuts it right there


CHECK_EVENT_IDX(idx);
CHECK_SNPRINTF(name_buf, sizeof(name_buf), "num_partition:device=%d", d);
CHECK_SNPRINTF(descr_buf, sizeof(descr_buf), "Device %d number of current partitions", d);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Are you able to list in the description all the devices that are present for that event? For example. num_partitions show:

5552 --------------------------------------------------------------------------------
5553 | amd_smi:::num_partition                                                      |
5554 |            Device 0 number of current partitions                             |
5555 |     :device=0                                                                |
5556 |            Mandatory device qualifier [0,1]                                  |
5557 --------------------------------------------------------------------------------

The description only makes note of Device 0, but you also have support for device 1 as well. So would it be possible to do: Device 0,1 number of current partitions? I

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the generic event descriptions are for AMD SMI, where it exposes device 0's event description.
Like:
"amd_smi:::L1_dcache_size_type_0"
"Device 0 L1 data cache size (bytes)"
It may be worth in future to add capabilities to have "0,1" listings like the hasing event descriptions.

CHECK_SNPRINTF(name_buf, sizeof(name_buf), "gpu_throttle_status:device=%d", d);
CHECK_SNPRINTF(descr_buf, sizeof(descr_buf),
"Device %d throttle status", d);
CHECK_SNPRINTF(descr_buf, sizeof(descr_buf), "Device %d current throttle status bitmask", d);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There are a few of the events that have been reformatted. Could you revert these changes as this PR is for adding the new counters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should rename the PR to include the modified descriptions detail? I found better documentation on what these metrics exactly do and they are under the same function(amdsmi_get_gpu_metrics_info_p) that queries these metrics, so I wanted to group it in under this PR.

if (add_event(&idx, name_buf, descr_buf, d, 10, 0, PAPI_MODE_READ,
access_amdsmi_gpu_metrics) != PAPI_OK)
return PAPI_ENOMEM;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new events all check out on Odyssey, but on Illyad with ROCm 7.1.1, the events:

amd_smi:::accumulation_counter
amd_smi:::prochot_residency_acc
amd_smi:::ppt_residency_acc 
amd_smi:::socket_thm_residency_acc
amd_smi:::vr_thm_residency_acc
amd_smi:::hbm_thm_residency_acc

will show a counter value of -1 when ran with papi_command_line. Do you know why this is occurring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small problem where amdsmi_get_gpu_metrics_info_p will return true if it returns just one true metric out of the 20 metrics it queries for.

I'm thinking maybe I should add additional checks to see if it returns sentinel values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants