amd_smi: add gpu_metrics accumulation and and residency counters#568
amd_smi: add gpu_metrics accumulation and and residency counters#568djwoun wants to merge 2 commits intoicl-utk-edu:masterfrom
Conversation
|
I am reviewing this PR. |
|
|
||
| CHECK_EVENT_IDX(idx); | ||
| CHECK_SNPRINTF(name_buf, sizeof(name_buf), "vr_thm_residency_acc:device=%d", d); | ||
| CHECK_SNPRINTF(descr_buf, sizeof(descr_buf), "Device %d accumulated voltage regulator thermal throttler residency", d); |
There was a problem hiding this comment.
From papi_native_avail on Illyad, this metric outputs a newline as follows:
5530 --------------------------------------------------------------------------------
5531 | amd_smi:::vr_thm_residency_acc |
5532 | Device 0 accumulated voltage regulator thermal throttler residency|
5533 | |
5534 | :device=0 |
5535 | Mandatory device qualifier [0,1] |
5536 --------------------------------------------------------------------------------
Is this from papi_native_avail formatting or is this coming from amd_smi?
There was a problem hiding this comment.
yeah, I think so that line of text must be exactly 66 characeters and papi_native_avail just cuts it right there
|
|
||
| CHECK_EVENT_IDX(idx); | ||
| CHECK_SNPRINTF(name_buf, sizeof(name_buf), "num_partition:device=%d", d); | ||
| CHECK_SNPRINTF(descr_buf, sizeof(descr_buf), "Device %d number of current partitions", d); |
There was a problem hiding this comment.
Q: Are you able to list in the description all the devices that are present for that event? For example. num_partitions show:
5552 --------------------------------------------------------------------------------
5553 | amd_smi:::num_partition |
5554 | Device 0 number of current partitions |
5555 | :device=0 |
5556 | Mandatory device qualifier [0,1] |
5557 --------------------------------------------------------------------------------
The description only makes note of Device 0, but you also have support for device 1 as well. So would it be possible to do: Device 0,1 number of current partitions? I
There was a problem hiding this comment.
This is how the generic event descriptions are for AMD SMI, where it exposes device 0's event description.
Like:
"amd_smi:::L1_dcache_size_type_0"
"Device 0 L1 data cache size (bytes)"
It may be worth in future to add capabilities to have "0,1" listings like the hasing event descriptions.
| CHECK_SNPRINTF(name_buf, sizeof(name_buf), "gpu_throttle_status:device=%d", d); | ||
| CHECK_SNPRINTF(descr_buf, sizeof(descr_buf), | ||
| "Device %d throttle status", d); | ||
| CHECK_SNPRINTF(descr_buf, sizeof(descr_buf), "Device %d current throttle status bitmask", d); |
There was a problem hiding this comment.
nit: There are a few of the events that have been reformatted. Could you revert these changes as this PR is for adding the new counters?
There was a problem hiding this comment.
Maybe I should rename the PR to include the modified descriptions detail? I found better documentation on what these metrics exactly do and they are under the same function(amdsmi_get_gpu_metrics_info_p) that queries these metrics, so I wanted to group it in under this PR.
| if (add_event(&idx, name_buf, descr_buf, d, 10, 0, PAPI_MODE_READ, | ||
| access_amdsmi_gpu_metrics) != PAPI_OK) | ||
| return PAPI_ENOMEM; | ||
|
|
There was a problem hiding this comment.
The new events all check out on Odyssey, but on Illyad with ROCm 7.1.1, the events:
amd_smi:::accumulation_counter
amd_smi:::prochot_residency_acc
amd_smi:::ppt_residency_acc
amd_smi:::socket_thm_residency_acc
amd_smi:::vr_thm_residency_acc
amd_smi:::hbm_thm_residency_acc
will show a counter value of -1 when ran with papi_command_line. Do you know why this is occurring?
There was a problem hiding this comment.
There is a small problem where amdsmi_get_gpu_metrics_info_p will return true if it returns just one true metric out of the 20 metrics it queries for.
I'm thinking maybe I should add additional checks to see if it returns sentinel values?
Pull Request Description
Adds throttle accumulation/residency counters(9) and edit descriptions. Tested papi utilities and amd smi tests on MI300A ROCm 7.1.1 and 6.4.4
Author Checklist
Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
Commits are self contained and only do one thing
Commits have a header of the form:
module: short descriptionCommits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
The PR needs to pass all the tests