Skip to content

Conversation

itzhakstern
Copy link

@itzhakstern itzhakstern commented Oct 16, 2025

What does this PR do?

Fixes an overflow issue that occurred when running Trainer.fit with validation enabled and a large number of epochs in combination with the ThroughputMonitor callback.

The problem was caused by incorrect computation of the training duration inside the on_validation_end method.
At the end of validation, ThroughputMonitor calculates both the validation duration and the time gap between training and validation in order to exclude it from the throughput calculation.
As part of this, it attempts to determine the total training time for the epoch by summing the values in the _time array.

However, this approach is incorrect because at each step, the _time array stores the cumulative time elapsed since t0, not incremental step durations. Therefore, summing the array results in an exaggerated total.

For example:
If t0 = 0 and an epoch has 5 steps, each taking 1 second, the _time array will look like [0, 1, 2, 3, 4, 5].
Summing this array yields 15 seconds, whereas the actual total training time is only 5 seconds.

Over a sufficiently large number of epochs, this summation error caused the accumulated time to grow without bound, eventually leading to a numeric overflow and runtime failure (as described in the linked issue).

The fix replaces the summation logic with use of the last element in _time, which correctly represents


In addition, ValueError: Expected the value to increase could occur in real (non-mocked) runs when validation was interleaved with training.
This happened because t0 was updated after validation, while the _time array still contained values based on the previous reference point.
This issue did not appear in the original tests, since they used mocked time.perf_counter values that always increase monotonically.

To address this, _start() now resets the internal arrays when trainer.state.fn == TrainerFn.FITTING, ensuring the throughput state is reinitialized after validation while still accumulating correctly during fitting.

Fixes #21257

Before submitting

Runing trainer.fit with a lot of epocks with the ThroughputMonitor was failed.

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 16, 2025
@itzhakstern itzhakstern marked this pull request as ready for review October 16, 2025 09:03
train_samples.append(metrics["train|samples"])


def test_throughput_monitor_validation_sum_overflow_real(tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way for us to check that things works as expected? maybe by mocking the timings to make sure the throughput is as expected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SkafteNicki I edited the test that will include checking the times even when running a large number of epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in ThroughputMonitor.on_validation_end: using sum() instead of last value can corrupt

2 participants