Add Power-aware Benchmarking Support #204

jiannanWang · 2025-12-15T21:57:12Z

This PR adds power-aware benchmarking for GPU kernels in BackendBench.
It introduces:

PowerManager: Collects GPU power, temperature, and frequency data during benchmarks.
do_bench_power: Benchmarks kernel execution while measuring energy consumption.
PerformanceTestResult: Now records total energy used per test.
Power plots: Generates CSV and visualizations for power, temperature, and frequency.
Dependencies: Adds nvidia-ml-py and matplotlib for power monitoring and plotting.
This enables precise measurement of energy usage for GPU kernels, supporting energy-efficient optimization.

Example (check the "total_energy" entries):

  {
    "op_name": "add.Scalar",
    "args": "((T([100, 1], i64), 1,), {})",
    "speedup": 1.0037302273210138,
    "total_energy": 0.0011545763355,
    "benchmark_time_ms": 0.006022225360552894,
    "reference_time_ms": 0.006044689630126131,
    "error_msg": "",
    "successfully_ran": true,
    "test_type": "performance"
  },
  {
    "op_name": "add.Tensor",
    "args": "((T([128100, 1536], f16), T([128100, 1536], f16),), {})",
    "speedup": 0.999044778094025,
    "total_energy": 0.13822618472830014,
    "benchmark_time_ms": 0.5277284108675443,
    "reference_time_ms": 0.5272243131290782,
    "error_msg": "",
    "successfully_ran": true,
    "test_type": "performance"
  },
  {
    "op_name": "add.Tensor",
    "args": "((T([256, 1024, 1024], f16), T([256, 1024, 1024], f16),), {})",
    "speedup": 1.0008341558163996,
    "total_energy": 0.1871731542646,
    "benchmark_time_ms": 0.7144741316636404,
    "reference_time_ms": 0.7150701144162346,
    "error_msg": "",
    "successfully_ran": true,
    "test_type": "performance"
  },
  {
    "op_name": "add_.Tensor",
    "args": "((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16),), {})",
    "speedup": 1.0001593311388208,
    "total_energy": 0.03350459256860003,
    "benchmark_time_ms": 0.1444975220835301,
    "reference_time_ms": 0.14452054503828043,
    "error_msg": "",
    "successfully_ran": true,
    "test_type": "performance"
  },

msaroufim · 2025-12-16T22:55:29Z

So the way I'd go about this before adding a utility is to take a couple of pytorch operators on H100 and B200, plot their runtimes in a loop and track the temperature and power draw respectively and see what we can learn

jiannanWang · 2025-12-17T05:08:59Z

So the way I'd go about this before adding a utility is to take a couple of pytorch operators on H100 and B200, plot their runtimes in a loop and track the temperature and power draw respectively and see what we can learn

Sounds great! I'll do the experiment.

jiannanWang added 2 commits November 10, 2025 16:09

power aware benchmarking

8271e86

poweraware benchmarking

32aa72a

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 15, 2025

fix test

964422f

jiannanWang marked this pull request as ready for review December 15, 2025 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Power-aware Benchmarking Support #204

Add Power-aware Benchmarking Support #204

Uh oh!

jiannanWang commented Dec 15, 2025 •

edited

Loading

Uh oh!

msaroufim commented Dec 16, 2025

Uh oh!

jiannanWang commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Power-aware Benchmarking Support #204

Are you sure you want to change the base?

Add Power-aware Benchmarking Support #204

Uh oh!

Conversation

jiannanWang commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Dec 16, 2025

Uh oh!

jiannanWang commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiannanWang commented Dec 15, 2025 •

edited

Loading