Skip to content

bug(trainer): wait_for_job_status silently allows polling_interval == timeout #400

@prabindersinghh

Description

@prabindersinghh

What happened?

Bug Description

In KubernetesBackend.wait_for_job_status():

if polling_interval > timeout:
    raise ValueError(
        f"Polling interval {polling_interval} must be less than timeout: {timeout}"
    )

When polling_interval == timeout, this guard passes. But:

round(timeout / polling_interval)  # round(10/10) = 1

The job is polled exactly once with no retry. The error message says
"must be less than" but the code allows equal values — a contradiction.

Steps to Reproduce

client = TrainerClient()
client.wait_for_job_status("my-job", timeout=10, polling_interval=10)
# Passes validation but only polls once — silent wrong behavior

Expected Behavior

ValueError raised when polling_interval >= timeout.

Proposed Fix

Change > to >= in the validation guard. One-line fix.

What did you expect to happen?

ValueError raised when polling_interval >= timeout, matching the documented constraint "must be less than timeout".

Environment

Kubernetes version:

$ kubectl version

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"

Kubeflow Python SDK version:

$ pip show kubeflow

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions