Skip to content

Fix: jobset based push node_pool_name to nodeselector#1187

Merged
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-208
Feb 12, 2026
Merged

Fix: jobset based push node_pool_name to nodeselector#1187
alfredyu-cienet merged 1 commit intoGoogleCloudPlatform:masterfrom
CIeNET-International:tpu-obs/release/pr-208

Conversation

@Chrisliao0806
Copy link
Contributor

Description

This PR fixes a critical scheduling issue where JobSet pods were being dispatched to arbitrary node pools instead of the intended ones, and extends the fix to support multi-node-pool environments.

Problem: Previously, the JobSet YAML template had no nodeSelector for the GKE node pool. In environments with multiple node pools, Kubernetes would schedule pods on any available nodes. This led to cases where a Rollback was performed on Node Pool A, but the JobSet pods were actually running on Node Pool B, resulting in inaccurate TTR (Time To Recovery) metrics. Additionally, for DAGs like tpu_info_format_validation_dag that create two node pools for the same workload, pinning to a single node pool name would leave the second pool unused.

Technical Implementation

Added a dynamic $node_pool_selector to the JobSet YAML template that supports scheduling modes:

  • Uses a custom label tpu-observability/jobset-group: <jobset_name> applied to all participating node pools via gcloud --node-labels at creation time, allowing the K8s scheduler to dispatch pods across multiple pools.

Test

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

…to desired NodePool

This change fixes a critical scheduling issue where JobSet pods were being dispatched to arbitrary node pools instead of the intended ones, and extends the fix to support multi-node-pool environments.

Problem: Previously, the JobSet YAML template had no nodeSelector for the GKE node pool. In environments with multiple node pools, Kubernetes would schedule pods on any available nodes. This led to cases where a Rollback was performed on Node Pool A, but the JobSet pods were actually running on Node Pool B, resulting in inaccurate TTR (Time To Recovery) metrics. Additionally, for DAGs like `tpu_info_format_validation_dag` that create two node pools for the same workload, pinning to a single node pool name would leave the second pool unused.
@alfredyu-cienet alfredyu-cienet merged commit 0ff7f02 into GoogleCloudPlatform:master Feb 12, 2026
7 checks passed
@alfredyu-cienet alfredyu-cienet deleted the tpu-obs/release/pr-208 branch February 12, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants