Fix calculate training time by summing all elapsed times instead the last one #21291

itzhakstern · 2025-10-16T07:52:10Z

What does this PR do?

Fixes an overflow issue that occurred when running Trainer.fit with validation enabled and a large number of epochs in combination with the ThroughputMonitor callback.

The problem was caused by incorrect computation of the training duration inside the on_validation_end method.
At the end of validation, ThroughputMonitor calculates both the validation duration and the time gap between training and validation in order to exclude it from the throughput calculation.
As part of this, it attempts to determine the total training time for the epoch by summing the values in the _time array.

However, this approach is incorrect because at each step, the _time array stores the cumulative time elapsed since t0, not incremental step durations. Therefore, summing the array results in an exaggerated total.

For example:
If t0 = 0 and an epoch has 5 steps, each taking 1 second, the _time array will look like [0, 1, 2, 3, 4, 5].
Summing this array yields 15 seconds, whereas the actual total training time is only 5 seconds.

Over a sufficiently large number of epochs, this summation error caused the accumulated time to grow without bound, eventually leading to a numeric overflow and runtime failure (as described in the linked issue).

The fix replaces the summation logic with use of the last element in _time, which correctly represents

In addition, ValueError: Expected the value to increase could occur in real (non-mocked) runs when validation was interleaved with training.
This happened because t0 was updated after validation, while the _time array still contained values based on the previous reference point.
This issue did not appear in the original tests, since they used mocked time.perf_counter values that always increase monotonically.

To address this, _start() now resets the internal arrays when trainer.state.fn == TrainerFn.FITTING, ensuring the throughput state is reinitialized after validation while still accumulating correctly during fitting.

Fixes #21257

Before submitting

Runing trainer.fit with a lot of epocks with the ThroughputMonitor was failed.

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

…taking the last calculation

…training-time-calc

SkafteNicki · 2025-10-16T11:17:27Z

tests/tests_pytorch/callbacks/test_throughput_monitor.py

            train_samples.append(metrics["train|samples"])
+
+
+def test_throughput_monitor_validation_sum_overflow_real(tmp_path):


is there a way for us to check that things works as expected? maybe by mocking the timings to make sure the throughput is as expected

@SkafteNicki I edited the test that will include checking the times even when running a large number of epochs.

…training-time-calc

fix: calculating training time by summing all differences instead of …

4b19400

…taking the last calculation

github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 16, 2025

itzhakstern marked this pull request as ready for review October 16, 2025 09:03

itzhakstern requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners October 16, 2025 09:03

Merge branch 'master' into bugfix/21257_throughput-monitor-incorrect-…

4aecd2d

…training-time-calc

SkafteNicki reviewed Oct 16, 2025

View reviewed changes

test timings

b0f71fe

itzhakstern requested a review from SkafteNicki October 17, 2025 08:09

Merge branch 'master' into bugfix/21257_throughput-monitor-incorrect-…

f9d375a

…training-time-calc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix calculate training time by summing all elapsed times instead the last one #21291

Fix calculate training time by summing all elapsed times instead the last one #21291

Uh oh!

itzhakstern commented Oct 16, 2025 •

edited by SkafteNicki

Loading

Uh oh!

SkafteNicki Oct 16, 2025

Uh oh!

itzhakstern Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		train_samples.append(metrics["train\|samples"])


		def test_throughput_monitor_validation_sum_overflow_real(tmp_path):

Fix calculate training time by summing all elapsed times instead the last one #21291

Are you sure you want to change the base?

Fix calculate training time by summing all elapsed times instead the last one #21291

Uh oh!

Conversation

itzhakstern commented Oct 16, 2025 • edited by SkafteNicki Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

SkafteNicki Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

itzhakstern Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

itzhakstern commented Oct 16, 2025 •

edited by SkafteNicki

Loading