Skip to content

fix: Reschedule TPU observability DAGs to avoid cluster resources con…#1160

Merged
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-197
Jan 28, 2026
Merged

fix: Reschedule TPU observability DAGs to avoid cluster resources con…#1160
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-197

Conversation

@YiHao990416
Copy link
Contributor

Description

Reschedule all TPU observability DAGs to

  1. node_pool_status: UTC 18:00
  2. interruption_validation_dag: UTC 18:30
  3. jobset_ttr_pod_delete: UTC 19:00
  4. multi_host_nodepool_rollback_dag: UTC 19:30
  5. tpu_info_format_validation_dags: UTC 20:00
  6. update_node_pool_label: UTC 20:30
  7. node_pool_ttr_disk_size: UTC 21:00
  8. node_pool_ttr_update_label: UTC 21:30
  9. tpu_sdk_monitoring_validation_dag: UTC 22:00
  10. jobset_ttr_rollback: UTC 22:30

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

@google-cla
Copy link

google-cla bot commented Jan 28, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

…flict (#197)

This change reschedules all TPU observability DAGs to the following time to avoid cluster resources conflict:
1. `node_pool_status`: UTC 18:00
2. `interruption_validation_dag`: UTC 18:30
3. `jobset_ttr_pod_delete`: UTC 19:00
4. `multi_host_nodepool_rollback_dag`: UTC 19:30
5. `tpu_info_format_validation_dags`: UTC 20:00
6. `update_node_pool_label`: UTC 20:30
7. `node_pool_ttr_disk_size`: UTC 21:00
8. `node_pool_ttr_update_label`: UTC 21:30
9. `tpu_sdk_monitoring_validation_dag`: UTC 22:00
10. `jobset_ttr_rollback`: UTC 22:30
@alfredyu-cienet alfredyu-cienet merged commit ab51f36 into GoogleCloudPlatform:master Jan 28, 2026
7 checks passed
@alfredyu-cienet alfredyu-cienet deleted the tpu-obs/release/pr-197 branch January 28, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants