Skip to content

Support for NVML counters: GPU energy, instant GPU power, avg memory power#525

Merged
Treece-Burgess merged 5 commits intoicl-utk-edu:masterfrom
mpatrou:energy_counters
Jan 9, 2026
Merged

Support for NVML counters: GPU energy, instant GPU power, avg memory power#525
Treece-Burgess merged 5 commits intoicl-utk-edu:masterfrom
mpatrou:energy_counters

Conversation

@mpatrou
Copy link
Contributor

@mpatrou mpatrou commented Dec 19, 2025

Pull Request Description

Sibling PR (1/2) split into #529

The code

  • adds support for 3 NVML energy counters:
    • energy_consumption
    • gpu_inst_power
    • gpu_memory_avg_power

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@Treece-Burgess
Copy link
Contributor

@mpatrou Thank you for posting this PR, I will take a look through it.

@Treece-Burgess Treece-Burgess self-requested a review December 29, 2025 18:14
nvmlReturn_t DECLDIR nvmlDeviceGetMemoryInfo(nvmlDevice_t, nvmlMemory_t *);
nvmlReturn_t DECLDIR nvmlDeviceGetPerformanceState(nvmlDevice_t, nvmlPstates_t *);
nvmlReturn_t DECLDIR nvmlDeviceGetPowerUsage(nvmlDevice_t, unsigned int *);
nvmlReturn_t DECLDIR nvmlDeviceGetTotalEnergyConsumption(nvmlDevice_t, unsigned long long *);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note from nvml.h, this function is supported from Volta and up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I tested on Voltar at Oregon with a P100 and it didn't seem to have an issue.

verbose_fprintf(stdout, "PAPI-HL Info: The event \"%s\" will be stored as instantaneous value.\n", requested_event_names[i]);
}

// except from nvml energy_consumption delta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: An above comment mentions that all nvml events will be instantaneous values. Why for energy_consumption and gpu_inst_power would this need to be changed to delta and the new event_type average respectively?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As when I use your updated papi_hl.c the value of energy_consumption is zero. Which for the master branch does not occur and I get a non-zero value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the first question.

  • Energy consumption is a cumulative counter. When we use papi_hl API to regions of code, we would need to measure the counter before and after the code and report the difference as the energy consumed to execute that region of code; thus the delta calculation.
  • The gpu_inst_power coule be benefit from an average calculation for a region, instead of reporting one value at the end of the region to give a more detailed information.

P.S. Thank you so much for your very detailed review! I am going though your comments and making changes locally!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the second comment: Did you try to profile some code that runs over 1ms or so? to give enough time to get 2 different energy values before and after the code execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the second comment: Did you try to profile some code that runs over 1ms or so? to give enough time to get 2 different energy values before and after the code execution.

I added a sleep of 10 seconds and did end up getting output for it. Thank you for pointing that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the first question.

Energy consumption is a cumulative counter. When we use papi_hl API to regions of code, we would need to measure the counter before and after the code and report the difference as the energy consumed to execute that region of code; thus the delta calculation.
The gpu_inst_power coule be benefit from an average calculation for a region, instead of reporting one value at the end of the region to give a more detailed information.

I understand, a current workaround that I thought of was the following workflow:

PAPI_hl_region_begin()
PAPI_hl_read() // Get your initial energy

// Kernel launch/Work

PAPI_hl_region_end()

The issue I see with this option, is you would then have to manually or programmatically grab the appropriate values and then take the difference.

Could you split this PR into two (I also can if you are currently busy with other work just let me know)? This PR can contain the updates to the NVML component and then the new PR would be updates just for papi_hl.c.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach we have on the PR, helps to profile the energy information of applications with kokkos and kokkos-tools papi connector: https://github.com/kokkos/kokkos-tools/tree/develop/profiling/papi-connector, with no change on the kokkos side. (as a side-note, for the reason we took that route)

Sure, no problem. I can split the PR . I'll move the updates for the papi_hl.c to another one.

/* get cycles for last component */
retval = PAPI_read_ts( _local_components[i].EventSet, _local_components[i].values, &_local_cycles );
}
retval = PAPI_read( _local_components[i].EventSet, _local_components[i].values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: In the PR description you mentioned that "internal_hl_read_counters that wasn't reading the counters for all components in the machines we tested". What exactly was your setup for this i.e. PAPI configure, system you were on, exported PAPI_EVENTS, etc.

I used the master papi_hl.c with this branch for the three new nvml events you added and all three show with values in the .json I end up creating.

@mpatrou mpatrou mentioned this pull request Jan 6, 2026
3 tasks
@mpatrou mpatrou changed the title Support for NVML counters: GPU energy, instant GPU power, avg memory power, and papi_hl region-based calculations updates Support for NVML counters: GPU energy, instant GPU power, avg memory power Jan 6, 2026
Copy link
Contributor

@Treece-Burgess Treece-Burgess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final testing was done on Methane at ICL (1 * A100) and Athena at Oregon (4 * A100).
In both testing cases ./configure --prefix=$PWD/test-install --with-components="nvml" --with-debug=yes was utilized.

Methane at ICL (1 * A100)

With Cuda Toolkit 12.9:

  • papi_component_avail - ✅ (count was updated from 27 in the master branch to 30 in this branch)
  • papi_native_avail - ✅ (showed the three new nvml native events added in this branch)
  • papi_command_line - ✅ (all three of the new nvml native events were able to be added and collect counter values for)
  • HelloWorld.cu - ✅ (for all three of the new nvml native events the test ran successfully)

Athena at Oregon (4 * A100)

Testing on this machine was only for compilation purposes to make sure that the #if defined's worked properly. To do this I used Cuda Toolkit 12.0 as NVML_POWER_SCOPE_GPU and NVML_POWER_SCOPE_MEMORY do not exist. Results are:

  • PAPI build: ✅
  • papi_native_avail - ✅ (it did not show the events for gpu_memory_avg_power and gpu_inst_power as expected)
  • papi_command_line - ✅ (ran successfully with total_energy_consumption)

@Treece-Burgess Treece-Burgess merged commit c827d2c into icl-utk-edu:master Jan 9, 2026
4 checks passed
@Treece-Burgess
Copy link
Contributor

@mpatrou Thank you for originally submitting the PR and all the changes you made! I did see you created PR #529 and will take a look at that soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants