fix(trainer): validate polling_interval is strictly less than timeout in ContainerBackend#412
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR addresses a missing input validation in ContainerBackend.wait_for_job_status() by adding a guard that raises ValueError when polling_interval >= timeout, preventing scenarios where the polling loop would not adequately check job status. The fix is intended to match the established pattern in the kubernetes and localprocess backends.
Changes:
- Add validation at the start of
ContainerBackend.wait_for_job_status()to check thatpolling_intervalis less thantimeout - Add a new test case to verify the validation raises
ValueErrorwhenpolling_intervalequalstimeout
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| kubeflow/trainer/backends/container/backend.py | Adds input validation for polling_interval/timeout constraint |
| kubeflow/trainer/backends/container/backend_test.py | Adds test case for polling_interval=timeout scenario |
| polling_interval: int = 2, | ||
| callbacks: list[Callable[[types.TrainJob], None]] | None = None, | ||
| ) -> types.TrainJob: | ||
| if polling_interval >= timeout: |
There was a problem hiding this comment.
The validation uses polling_interval >= timeout but the kubernetes backend (line 463) and localprocess backend (line 229) both use polling_interval > timeout. This inconsistency means the container backend will reject valid inputs that the other two backends accept (when polling_interval equals timeout). Align this validation with the other backends by changing >= to >.
| if polling_interval >= timeout: | |
| if polling_interval > timeout: |
Summary
ContainerBackend.wait_for_job_statuswas missing input validation that both the kubernetes and localprocess backends already have.The Bug
When
polling_interval >= timeout, the time-based while loop exits before the firsttime.sleep()completes — job status is checked zero times — and aTimeoutErrorfires immediately with a misleading message. No error is raised for clearly invalid input.The Fix
Added a
ValueErrorguard matching the established pattern in the other two backends.Validation
polling_interval == timeoutFixes #411
Related: #402