Skip to content

fix: Bigger eval_interval to avoid timeout#1167

Merged
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:maxtext/release/pr-199
Jan 30, 2026
Merged

fix: Bigger eval_interval to avoid timeout#1167
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:maxtext/release/pr-199

Conversation

@RUEI4341
Copy link
Contributor

Description

The training jobs of the DAG are timing out recently. Upon reviewing the logs from our jax-tpu container, we have identified a main performance issues:

Periodic Slowdowns: We observe significant, recurring slowdowns in the training speed approximately every 100 training steps (e.g., around steps 100, 200, etc., up to the total of 10199 steps). While some slowdown at these intervals is also present in successful runs, the duration of these delays is much longer in the runs that eventually time out.

Impact: The exacerbated periodic slowdowns is causing our training jobs to exceed their allocated time and fail with timeouts. The logs indicate that runs are timing out, whereas previously, they would complete.

This PR increased the eval period (eval_interval) to 5000 to prevent periodic slowdowns in the training.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

The training jobs of the DAG are timing out recently. Upon reviewing the logs from our jax-tpu container, we have identified a main performance issues:

Periodic Slowdowns: We observe significant, recurring slowdowns in the training speed approximately every 100 training steps (e.g., around steps 100, 200, etc., up to the total of 10199 steps). While some slowdown at these intervals is also present in successful runs, the duration of these delays is much longer in the runs that eventually time out.

Impact: The exacerbated periodic slowdowns is causing our training jobs to exceed their allocated time and fail with timeouts. The logs indicate that runs are timing out, whereas previously, they would complete.

This PR increased the eval period (`eval_interval`) to 5000 to prevent periodic slowdowns in the training.
@alfredyu-cienet alfredyu-cienet merged commit e5a0b4c into GoogleCloudPlatform:master Jan 30, 2026
7 checks passed
@alfredyu-cienet alfredyu-cienet deleted the maxtext/release/pr-199 branch January 30, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants