Skip to content

fix(trainer): validate polling_interval is strictly less than timeout in ContainerBackend#412

Open
Varun-39 wants to merge 1 commit intokubeflow:mainfrom
Varun-39:fix/container-polling-interval-validation
Open

fix(trainer): validate polling_interval is strictly less than timeout in ContainerBackend#412
Varun-39 wants to merge 1 commit intokubeflow:mainfrom
Varun-39:fix/container-polling-interval-validation

Conversation

@Varun-39
Copy link

Summary

ContainerBackend.wait_for_job_status was missing input validation that both the kubernetes and localprocess backends already have.

The Bug

When polling_interval >= timeout, the time-based while loop exits before the first time.sleep() completes — job status is checked zero times — and a TimeoutError fires immediately with a misleading message. No error is raised for clearly invalid input.

The Fix

Added a ValueError guard matching the established pattern in the other two backends.

Validation

  • Added unit test for polling_interval == timeout
  • 43/43 tests pass

Fixes #411
Related: #402

Copilot AI review requested due to automatic review settings March 20, 2026 14:33
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a missing input validation in ContainerBackend.wait_for_job_status() by adding a guard that raises ValueError when polling_interval >= timeout, preventing scenarios where the polling loop would not adequately check job status. The fix is intended to match the established pattern in the kubernetes and localprocess backends.

Changes:

  • Add validation at the start of ContainerBackend.wait_for_job_status() to check that polling_interval is less than timeout
  • Add a new test case to verify the validation raises ValueError when polling_interval equals timeout

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
kubeflow/trainer/backends/container/backend.py Adds input validation for polling_interval/timeout constraint
kubeflow/trainer/backends/container/backend_test.py Adds test case for polling_interval=timeout scenario

polling_interval: int = 2,
callbacks: list[Callable[[types.TrainJob], None]] | None = None,
) -> types.TrainJob:
if polling_interval >= timeout:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation uses polling_interval >= timeout but the kubernetes backend (line 463) and localprocess backend (line 229) both use polling_interval > timeout. This inconsistency means the container backend will reject valid inputs that the other two backends accept (when polling_interval equals timeout). Align this validation with the other backends by changing >= to >.

Suggested change
if polling_interval >= timeout:
if polling_interval > timeout:

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(trainer): ContainerBackend.wait_for_job_status missing polling_interval validation

2 participants