Fix calculate training time by summing all elapsed times instead the last one #21291
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes an overflow issue that occurred when running Trainer.fit with validation enabled and a large number of epochs in combination with the
ThroughputMonitor
callback.The problem was caused by incorrect computation of the training duration inside the
on_validation_end
method.At the end of validation, ThroughputMonitor calculates both the validation duration and the time gap between training and validation in order to exclude it from the throughput calculation.
As part of this, it attempts to determine the total training time for the epoch by summing the values in the
_time
array.However, this approach is incorrect because at each step, the
_time
array stores the cumulative time elapsed sincet0
, not incremental step durations. Therefore, summing the array results in an exaggerated total.For example:
If
t0 = 0
and an epoch has5 steps
, each taking1 second
, the _time array will look like[0, 1, 2, 3, 4, 5]
.Summing this array yields
15 seconds
, whereas the actual total training time is only5 seconds
.Over a sufficiently large number of epochs, this summation error caused the accumulated time to grow without bound, eventually leading to a numeric overflow and runtime failure (as described in the linked issue).
The fix replaces the summation logic with use of the last element in
_time
, which correctly representsIn addition,
ValueError: Expected the value to increase
could occur in real (non-mocked) runs when validation was interleaved with training.This happened because
t0
was updated after validation, while the_time
array still contained values based on the previous reference point.This issue did not appear in the original tests, since they used mocked
time.perf_counter
values that always increase monotonically.To address this,
_start()
now resets the internal arrays whentrainer.state.fn == TrainerFn.FITTING
, ensuring the throughput state is reinitialized after validation while still accumulating correctly during fitting.Fixes #21257
Before submitting
Runing trainer.fit with a lot of epocks with the ThroughputMonitor was failed.
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist