Skip to content

Commit 68f4a34

Browse files
authored
Add a new DAG that validates data provided by the tpu-info CLI (#1190)
This change automates a DAG that does cross-validation of TPU performance data. It ensures that the metrics reported directly from the hardware are consistent with the data ingested into the cloud monitoring pipeline. The verification suite includes metrics such as TPU Utilization, TensorCore Activity, Memory Usage, and Latency, and is designed to automatically scale as new metric strategies are added to the validation library.
1 parent ae3b22d commit 68f4a34

File tree

8 files changed

+1059
-10
lines changed

8 files changed

+1059
-10
lines changed

dags/common/scheduling_helper/scheduling_helper.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ class DayOfWeek(enum.Enum):
5252
"tpu_sdk_monitoring_validation": dt.timedelta(minutes=30),
5353
"jobset_ttr_kill_process": dt.timedelta(minutes=90),
5454
"jobset_uptime_validation": dt.timedelta(minutes=90),
55+
"tpu_info_metrics_verification": dt.timedelta(minutes=30),
5556
},
5657
}
5758

dags/tpu_observability/interruption_validation_dag.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ def fetch_interruption_metric_records(
162162

163163
# key: resource_name, value: EventRecord
164164
event_records: dict[str, EventRecord] = {}
165-
response = gcp_util.query_time_series(
165+
response = gcp_util.list_time_series(
166166
project_id=configs.project_id,
167167
filter_str=metric_filter,
168168
start_time=time_util.TimeUtil.from_unix_seconds(time_range.start),

0 commit comments

Comments
 (0)