Skip to content

Feat: add jobset_ttr_node_pool_resize DAG for v6e recovery validation#1163

Merged
alfredyu-cienet merged 2 commits intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-194
Jan 29, 2026
Merged

Feat: add jobset_ttr_node_pool_resize DAG for v6e recovery validation#1163
alfredyu-cienet merged 2 commits intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-194

Conversation

@chiajunglien
Copy link
Contributor

Description

The primary goal of this PR is to implement the jobset_ttr_node_pool_resize DAG. This test validates the self-healing and recovery metrics of TPU JobSets when the underlying infrastructure undergoes a disruptive change.

Technical Implementation

  • Dynamic Provisioning: Builds node pool info from GCS configurations and provisions TPU resources based on MachineConfigMap (e.g., v6e-16).
  • JobSet Deployment: Deploys a JAX TPU benchmark workload using the JobSet operator.
  • Fault Injection: @task update_node_pool_disk_size triggers a gcloud container node-pools update to change disk size (e.g., from 100GB to 200GB), forcing a Rolling Update (Surge Upgrade) on the node pool.
  • Detection Phase: Monitors the JobSet's reaction as nodes are sequentially recreated. It expects the JobSet to eventually stabilize all pods on the new 200GB nodes.
  • Performance Metric: Measures the Time To Recovery (TTR).

Airflow/Composer

Required Variables

  • Cluster Information (This DAG requires an existing cluster)

    • PROJECT_ID: The GCP project ID where the cluster resides. (Default to cienet-cmcs)
    • CLUSTER_NAME: The name of the target GKE cluster. (Default to tpu-observability-automation-dev)
    • LOCATION: The region of the GKE cluster. (Default to us-central1)
  • Node Pool Configurations

    • NODE_POOL_NAME: The base name for the new node pool. (Default to jobset-ttr-node-pool-resize-v6e)
    • NODE_LOCATIONS: The zone for the nodes in the normal test path. (Default to us-central1-b)
    • NUM_NODES: The number of nodes to create in the pool. (Default to 4)
    • MACHINE_TYPE: The machine type for the GKE nodes. (Default to ct6e-standard-4t)
    • TPU_TOPOLOGY: The TPU topology for the node pool. (Default to 4x4)

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

alfredyu-cienet and others added 2 commits January 29, 2026 12:03
This change implements a new DAG `jobset_ttr_node_pool_resize`. This test validates the self-healing and recovery metrics of TPU JobSets when the underlying infrastructure undergoes a disruptive change.
@alfredyu-cienet alfredyu-cienet merged commit 8bb2f2e into GoogleCloudPlatform:master Jan 29, 2026
7 checks passed
@alfredyu-cienet alfredyu-cienet deleted the tpu-obs/release/pr-194 branch January 29, 2026 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants