feat: add jobset_ttr_kill_process DAG for Time-To-Recover (TTR) validation#1173
Merged
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom Feb 4, 2026
Conversation
This change introduces a new Airflow DAG, `jobset_ttr_kill_process`, designed to validate the Time-To-Recover (TTR) metrics for TPU JobSets. The DAG simulates a workload failure by injecting a fault (killing the main Python process) and monitors the system's ability to recover and log the recovery duration.
alfredyu-cienet
approved these changes
Feb 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a new Airflow DAG,
jobset_ttr_kill_process, designed to validate the Time-To-Recover (TTR) metrics for TPU JobSets. The DAG simulates a workload failure by injecting a fault (killing the main Python process) and monitors the system's ability to recover and log the recovery duration.Technical Implementation
The DAG follows a structured lifecycle to ensure clean testing environments:
Dynamic Provisioning: Builds node pool info from GCS configurations and provisions TPU resources based on MachineConfigMap (e.g., v6e-16).
JobSet Deployment: Deploys a JAX TPU benchmark workload using the JobSet operator.
Fault Injection: Once the workload is active, the
@taskkill_tpu_pod_workloaduseskubectl exec to run pkill -9on the Python processes within the worker pods.Metric Observation: Includes a wait task (
wait_for_jobset_ttr_to_be_found) to verify that the TTR metric is successfully published to the observability backend.Tests
cloud-ml-auto-solutions)2.13.1Checklist
Before submitting this PR, please make sure (put X in square brackets):