Metrics for training #982

michaelbenayoun · 2025-09-29T13:16:16Z

What does this PR do?

This PR introduces a training metrics collection system.

Plugin-based architecture: Modular system with ThroughputPlugin, MFUPlugin, EfficiencyPlugin, and ComponentTimingPlugin
Moving window statistics: Configurable window size for real-time metrics calculation
Hardware detection: Automatic TRN1/TRN2 platform detection with correct peak FLOPS values

Core Metrics Implementation

Throughput metrics: Tokens/second calculation with proper data parallel scaling
Model FLOPS Utilization (MFU): System-wide MFU calculation using PaLM paper formula: 6N + 12LHQ*T FLOPS per token
Training efficiency: Breakdown of time spent on forward/backward/optimizer vs overhead
Component timing: Individual timing for forward pass, backward pass, optimizer step, and total step

HuggingFaceDocBuilderDev · 2025-09-29T13:21:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ging

tengomucho · 2025-10-15T12:05:09Z

examples/training/qwen3/finetune_qwen3.sh

-MODEL_NAME="Qwen/Qwen3-8B" # Change this to the desired model name
+# MODEL_NAME="Qwen/Qwen3-8B" # Change this to the desired model name
+MODEL_NAME="Qwen/Qwen3-0.6B" # Change this to the desired model name


I think you should revert it to Qwen3-8B

I added the changes for this file not on purpose, it was a mistake, I am reverting it.

examples/training/qwen3/finetune_qwen3.sh

tengomucho · 2025-10-15T12:15:03Z

optimum/neuron/trainers/metrics/constants.py

+# https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium2.html#compute
+HARDWARE_TFLOPS = {
+    "trn1": {
+        "fp32": 48 / 2,


The performance metrics are given per chip, we need it per-core.

ok, consider adding a comment

tengomucho · 2025-10-15T14:08:48Z

optimum/neuron/trainers/metrics/mfu.py

+
+        N = collector.model_params
+        L, H, Q, T = collector.num_layers, collector.num_heads, collector.head_dim, collector.seq_length
+        flops_per_token = 6 * N + 12 * L * H * Q * T


can you explain better where this calculation comes from?

It is from the PalM paper, I will add the reference in the code.

tengomucho · 2025-10-15T14:09:14Z

optimum/neuron/trainers/metrics/mfu.py

+
+        N = collector.model_params
+        L, H, Q, T = collector.num_layers, collector.num_heads, collector.head_dim, collector.seq_length
+        flops_per_token = 6 * N + 12 * L * H * Q * T


same here (maybe you can even centralize it in a method)

tengomucho · 2025-10-15T14:14:14Z

tests/training/test_metrics.py

+        (8, 1, 1),
+        (32, 8, 1),
+        (32, 1, 4),
+        (32, 8, 4),


couldn't we just test the last set of params? Otherwise this can make the CI even longer (it's 1h24 already!)

It is not that long, I'd rather be sure that the metrics are properly computed. This tests does not run any forward pass / backward pass / optimizer state. It is a lighweight one.

dacorvo · 2025-10-16T11:48:43Z

optimum/neuron/trainers/metrics/constants.py

+        "bf16": 191 / 2,
+    },
+    "trn2": {
+        "fp32": 181 / 2,


trn2 chips have 8 cores per device, that are by default grouped by pairs into 4 virtual devices.

Thanks, I have updated that!

…ccount

JingyaHuang

It's huge! Before merging it, do you know why the distributed training CI is failing? Can we fix it @michaelbenayoun

JingyaHuang

lgtm, thanks Michael!

feat: metrics for training

9625ce6

michaelbenayoun added 28 commits September 29, 2025 16:04

feat: metrics for training

c212f8c

feat: metrics for training

833888e

feat: metrics for training

9dd04c5

feat: metrics for training

f85638a

feat: metrics for training

dcd5883

feat: add summary for metrics

3d1fd56

feat: add summary for metrics

4df2d90

feat: effective throughput

ea5edf0

feat: polishing the metrics classes

cb0a5b3

feat: optimization of metric computation

879d75a

test: add tests for metrics, wip

e99d37d

Add training efficiency

902d69e

feat: add overhead time

9ba7d53

feat: cleanup and keep training efficiency

7d6ff60

feat: remove docstrings

b86dfc2

feat: remove backward compatibility code

5168337

feat: keep relevant metrics

f2498c1

feat: add metrics breakdown

14cb336

feat: handle the case where there are no metrics

6e391bf

fix: change sec to % for percents

2d8e8e0

remove test_metrics.py

090a7c5

test: add tests for metrics

26a0f19

test: improve metrics tests

7ef5ed5

test: improve metrics tests

08c6db4

feat: added plugin system for metrics

5499f19

refactor: delete metric.py file since plugin system changes that

48c620d

refactor: improve the existing base

aa9821e

refactor: easier imports

8025e1b

michaelbenayoun added 15 commits October 3, 2025 19:56

refactor: easier imports

b987e04

test: improve metrics tests

d8b38fc

test: improve metrics tests

a47e9b7

test: improve metrics tests

1975137

Merge branch 'main' into metrics

b4543ea

test: improve metrics tests

0ed9244

fix: mfu computation

99e0eb5

fix: mfu computation

7f142e5

fix: training effiency

c155c57

fix: add missing files

5302622

feat: enable metrics collection only for the rank responsible for log…

a7f9ea7

…ging

wip: test metrics

6c616db

test: change the assert criteria to account for rounding

6aca088

feat: add hardware specs per dtype

a42a095

fix: broken import

53ea695

michaelbenayoun marked this pull request as ready for review October 15, 2025 11:57

michaelbenayoun requested a review from tengomucho October 15, 2025 11:57

tengomucho reviewed Oct 15, 2025

View reviewed changes

michaelbenayoun added 2 commits October 15, 2025 19:03

refactor: mfu computation in a function

564ee50

fix: restore finetune_qwen3.sh

4b5480e

dacorvo reviewed Oct 16, 2025

View reviewed changes

fix: divide the flops constants by 8 for trn2 and take the lnc into a…

86918db

…ccount

JingyaHuang reviewed Oct 29, 2025

View reviewed changes

michaelbenayoun added 3 commits October 31, 2025 08:53

fix: add useful comment

681ebec

fix: cache dtype entry

ea70bb8

Merge branch 'main' into metrics

3e0cdb7

JingyaHuang approved these changes Oct 31, 2025

View reviewed changes

michaelbenayoun merged commit 1dfe4da into main Oct 31, 2025
5 checks passed

michaelbenayoun deleted the metrics branch October 31, 2025 14:59

Metrics for training #982

Metrics for training #982

Uh oh!

Conversation

michaelbenayoun commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingyaHuang left a comment

Choose a reason for hiding this comment

Uh oh!

JingyaHuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

michaelbenayoun commented Sep 29, 2025 •

edited

Loading