Skip to content

feat: add jobset_ttr_kill_process DAG for Time-To-Recover (TTR) validation#1173

Merged
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-164
Feb 4, 2026
Merged

feat: add jobset_ttr_kill_process DAG for Time-To-Recover (TTR) validation#1173
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-164

Conversation

@chengpinglin
Copy link
Contributor

Description

This PR introduces a new Airflow DAG, jobset_ttr_kill_process, designed to validate the Time-To-Recover (TTR) metrics for TPU JobSets. The DAG simulates a workload failure by injecting a fault (killing the main Python process) and monitors the system's ability to recover and log the recovery duration.

Technical Implementation

The DAG follows a structured lifecycle to ensure clean testing environments:

  • Dynamic Provisioning: Builds node pool info from GCS configurations and provisions TPU resources based on MachineConfigMap (e.g., v6e-16).

  • JobSet Deployment: Deploys a JAX TPU benchmark workload using the JobSet operator.

  • Fault Injection: Once the workload is active, the @task kill_tpu_pod_workload uses kubectl exec to run pkill -9 on the Python processes within the worker pods.

  • Metric Observation: Includes a wait task (wait_for_jobset_ttr_to_be_found) to verify that the TTR metric is successfully published to the observability backend.

Tests

  • GCP Composer name: tony-test (under GCP project: cloud-ml-auto-solutions)
  • GCP Composer version: 2.13.1

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

This change introduces a new Airflow DAG, `jobset_ttr_kill_process`, designed to validate the Time-To-Recover (TTR) metrics for TPU JobSets. The DAG simulates a workload failure by injecting a fault (killing the main Python process) and monitors the system's ability to recover and log the recovery duration.
@alfredyu-cienet alfredyu-cienet merged commit cab29a7 into GoogleCloudPlatform:master Feb 4, 2026
7 checks passed
@alfredyu-cienet alfredyu-cienet deleted the tpu-obs/release/pr-164 branch February 4, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants