Skip to content

fix: Prevent DEADLINE_EXCEEDED errors in multi-slice TPU jobs#1166

Merged
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:maxtext/release/pr-171
Jan 30, 2026
Merged

fix: Prevent DEADLINE_EXCEEDED errors in multi-slice TPU jobs#1166
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:maxtext/release/pr-171

Conversation

@AidenYu1673
Copy link
Contributor

@AidenYu1673 AidenYu1673 commented Jan 30, 2026

Problem

Multi-slice TPU tasks were failing with DEADLINE_EXCEEDED and Heartbeat timeouts. Log analysis showed an observed 327s compilation window which exceeded the default JAX RPC timeout, causing nodes to be dropped before the training loop started.

Solution

Optimized JAX coordination settings to ensure stability during heavy compilation:

"JAX_COORDINATION_SERVICE_HEARTBEAT_TIMEOUT_SECONDS=1200 ": Prevents task termination during intensive computation.

Tests

update version: jax_ai_image_tpu_e2e_time10_prcache

Fixes

b/475103757

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Multi-slice TPU tasks were failing with `DEADLINE_EXCEEDED` and `Heartbeat` timeouts. Log analysis showed an observed 327s compilation window which exceeded the default JAX RPC timeout, causing nodes to be dropped before the training loop started.
@alfredyu-cienet alfredyu-cienet merged commit 50e896e into GoogleCloudPlatform:master Jan 30, 2026
27 checks passed
@alfredyu-cienet alfredyu-cienet deleted the maxtext/release/pr-171 branch January 30, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants