Add energy efficiency tracking index to llama-bench tests #4297

RafaAguilar · 2023-12-02T18:14:39Z

RafaAguilar
Dec 2, 2023

While reporting the Apple M metrics that measures Perplexity and Token per Seconds, I thought it would be nice to also track energy consumption to compare not only performance but costs to that gained performance with cores and memory scale.

I have done an small POC and for a llama-bench with three models I got the following result:

Of course, that is a very raw result as it is only a POC, however much more information can be extracted from the data source (powermetrics), here is a sample of one datapoint:

{'is_delta': True,
 'elapsed_ns': 1003612708,
 'hw_model': 'Mac13,1',
 'kern_osversion': '23B81',
 'kern_bootargs': '',
 'kern_boottime': 1701259079,
 'timestamp': datetime.datetime(2023, 12, 2, 16, 55, 39),
 'gpu': {'freq_hz': 643.424,
  'idle_ns': 586788125,
  'idle_ratio': 0.584201,
  'dvfm_states': [{'freq': 389, 'used_ns': 7373208, 'used_ratio': 0.0073407},
   {'freq': 486, 'used_ns': 0, 'used_ratio': 0.0},
   {'freq': 648, 'used_ns': 410266541, 'used_ratio': 0.408458},
   {'freq': 778, 'used_ns': 0, 'used_ratio': 0.0},
   {'freq': 972, 'used_ns': 0, 'used_ratio': 0.0},
   {'freq': 1296, 'used_ns': 0, 'used_ratio': 0.0}],
  'sw_requested_state': [{'sw_req_state': 'P1',
    'used_ns': 0,
    'used_ratio': 0.0},
   {'sw_req_state': 'P2', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_req_state': 'P3', 'used_ns': 417857000, 'used_ratio': 1.0},
   {'sw_req_state': 'P4', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_req_state': 'P5', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_req_state': 'P6', 'used_ns': 0, 'used_ratio': 0.0}],
  'sw_state': [{'sw_state': 'SW_P1', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_state': 'SW_P2', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_state': 'SW_P3', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_state': 'SW_P4', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_state': 'SW_P5', 'used_ns': 0, 'used_ratio': 0.0},
   {'sw_state': 'SW_P6', 'used_ns': 0, 'used_ratio': 0.0}],
  'gpu_energy': 117}}

Here is the dirty & quick python code used to create the graph:

import plistlib
import pandas
import matplotlib.pyplot as plt


xmls = open("sample.out")
plist = xmls.read()

records = []

raw_samples = plist.split("</plist>")


for raw_sample in raw_samples[:-1]:
    records.append(
      plistlib.loads(
        (
            raw_sample + "</plist>"
        ).strip("\n").strip("\x00").encode("utf-8")
      )
    )

df = pandas.DataFrame(columns=["frequency", "energy"])
for record in records:
    df.loc[len(df)] = [record['gpu']['freq_hz'], record['gpu']['gpu_energy']/1000]


ax1 = df[0:100].frequency.plot(color='r', marker='x')
ax2 = df[0:100].energy.plot(secondary_y=True, color='k', marker='o')

ax1.set_ylabel("Frequency (MHz)")
ax2.set_ylabel("Energy (Watts)")
plt.show()

and here the PowerMetrics command line to track the test:

sudo nice -n 10 powermetrics --samplers gpu_power -f plist -i 1000 > sample.out

llama-bench command used to generate this data:

./llama-bench \
  -m ../llama/llama-2-7b/ggml-model-f16.gguf \
  -m ../models/llama2/llama-2-7b.Q8_0.gguf \
  -m ../models/llama2/llama-2-7b.Q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Powermetrics offers much more metrics through samplers, I only use gpu_power as it would be the more juicy one, however there is a thermal energy tracking as well that could be useful for benchmarking.

In order to make this work we would need to:

kick in its own process the powermetrics tool when the test starts and stop it when the test finishes
either capture and process data in c++ and determine what should be the metric/index for each test (watts per hour?)
optionally dump the raw data into a file to post process it for further study.

If there is enough interest on it, I could use some spare time to work on it.

Related projects:

https://github.com/tlkh/asitop/blob/main/asitop/parsers.py

brozkrut · 2023-12-03T14:31:57Z

brozkrut
Dec 3, 2023

M2 Max Studio, 8+4 CPU, 38 GPU, 96 GB RAM

% ./llama-bench -m models/llama-7b-v2/ggml-model-f16.gguf -m models/llama-7b-v2/ggml-model-q8_0.gguf -m models/llama-7b-v2/ggml-model-q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	756.64 ± 0.40
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.68 ± 0.03
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	677.76 ± 0.15
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.86 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	671.35 ± 0.18
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.89 ± 0.06

build: 8e672ef (1550)

% sudo nice -n 10 powermetrics --samplers gpu_power -f plist -i 1000 > sample.out

0 replies

easp · 2023-12-05T03:18:00Z

easp
Dec 5, 2023

This is a cool idea.

Before going to far with it, though, try looking at system power consumption during inference using https://github.com/exelban/stats. There seems to be a significant, load-dependent, draw that isn't accounted for in the cpu_power or gpu_power sampler. I don't know if it's the power draw from RAM, or what. I also saw similar results using iStat Menus, an older commercial utility that Stats kind of mimics.

1 reply

AndreasKunar Apr 29, 2025

OK, this is an old, but very interesting thread and thanks for pointing out the other power-metering (I only knew asitop). I will try to look more deeply into this.

I also measured power-draw on a Jetson Orin NX (which is built for energy-efficiency, even though it's now 2 NVIDIA GPU generations old). And I also found, that there are significant other power-drawing items besides CPU/GPU - likely RAM during tg.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add energy efficiency tracking index to llama-bench tests #4297

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Add energy efficiency tracking index to llama-bench tests #4297

Uh oh!

Uh oh!

RafaAguilar Dec 2, 2023

Replies: 2 comments · 1 reply

Uh oh!

brozkrut Dec 3, 2023

M2 Max Studio, 8+4 CPU, 38 GPU, 96 GB RAM

Uh oh!

easp Dec 5, 2023

Uh oh!

AndreasKunar Apr 29, 2025

RafaAguilar
Dec 2, 2023

Replies: 2 comments 1 reply

brozkrut
Dec 3, 2023

easp
Dec 5, 2023