Skip to content

Conversation

@aniketpati1121
Copy link
Contributor

This PR implements the behavior requested in issue #182.

Previously, trainer.get_job_logs(job_id, follow=True) exited immediately if the pod did not yet exist or was still pending. This made it difficult for users to follow logs immediately after submitting a job, because pods are usually created asynchronously.

What this PR adds

When follow=True, the backend now waits for the pod to be created and to leave the Pending state.
Added a simple polling loop with:
timeout: 120 seconds
poll interval: 2 seconds
Preserves old behavior for follow=False, returning immediately if no pod exists.
No API changes, fully backward compatible.

Why this is needed

Users commonly want to follow logs right after submitting a TrainingJob.
With the previous behavior, they needed to implement custom waiting logic.
This PR aligns the trainer experience with typical Kubernetes log-following behavior.

Testing

All existing tests pass (162 passed).
No breaking changes.
Local manual tests done.

Fixes #182.

Signed-off-by: Aniket Patil <aniketpatil2027@gmail.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Aniket Patil <aniketpatil2027@gmail.com>
@coveralls
Copy link

Pull Request Test Coverage Report for Build 19798151671

Details

  • 5 of 21 (23.81%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 66.314%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/kubernetes/backend.py 5 21 23.81%
Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/backends/kubernetes/backend.py 1 75.19%
Totals Coverage Status
Change from base Build 19715816162: -0.3%
Covered Lines: 2506
Relevant Lines: 3779

💛 - Coveralls

@aniketpati1121
Copy link
Contributor Author

aniketpati1121 commented Dec 7, 2025

Hi @szaher @kramaranya
This PR implements waiting for the pod to be running when follow=True as discussed in #182.
Looking forward to your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

trainer get_job_logs should wait for pods to be running when follow=True

2 participants