Skip to content

Fix: Unify TPU SSH mechanism to resolve race conditions#1182

Merged
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:maxtext/release/pr-210
Feb 11, 2026
Merged

Fix: Unify TPU SSH mechanism to resolve race conditions#1182
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:maxtext/release/pr-210

Conversation

@AidenYu1673
Copy link
Contributor

@AidenYu1673 AidenYu1673 commented Feb 11, 2026

PR Description

This PR transitions the TPU SSH connection mechanism from ephemeral key injection to a persistent OS Login architecture. By leveraging long-lived SSH keys stored in the Service Account's OS Login profile, we eliminate the race conditions (409 Conflict) frequently encountered when running multiple concurrent TPU tasks in Airflow.

Key Change: Simplified Login Strategy

Previously, we attempted to dynamically detect whether to use OS Login or traditional SSH keys. However, this approach introduced unnecessary complexity and detection failures. In this PR, we have unified the login mechanism.

Please ensure the following Airflow Variables are configured in the production environment:

  • os-login-ssh-private-key, os-login-ssh-user, os-login-ssh-public-key

Tests

jax_functional_tests2
jax_functional_tests3

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

This change transitions the TPU SSH connection mechanism from ephemeral key injection to a persistent OS Login architecture. By leveraging long-lived SSH keys stored in the Service Account's OS Login profile, we eliminate the race conditions (409 Conflict) frequently encountered when running multiple concurrent TPU tasks in Airflow.
@alfredyu-cienet alfredyu-cienet merged commit ba360e1 into GoogleCloudPlatform:master Feb 11, 2026
27 checks passed
@alfredyu-cienet alfredyu-cienet deleted the maxtext/release/pr-210 branch February 11, 2026 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants