Skip to content

Conversation

SoumyaRaikwar
Copy link

What this PR does / why we need it

This PR adds the kube_deployment_spec_topology_spread_constraints metric that counts the number of topology spread constraints defined in a deployment's pod template specification.

This PR solves the topology spread constraints monitoring requirement from issue #2701, which specifically requested visibility into scheduling primitives including "pod topology spread constraints" for workload pod distribution monitoring.

Which issue(s) this PR fixes

Solves topology spread constraints monitoring from #2701 - Add schedule spec and status for workload

Problem Solved

Issue #2701 identified that operators need to monitor various scheduling primitives to detect when "break variation may happen because pod priority preemption or node pressure eviction."

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 10, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 10, 2025
@SoumyaRaikwar SoumyaRaikwar changed the title Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 feat: Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 Aug 10, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 10, 2025
@mrueg
Copy link
Member

mrueg commented Aug 13, 2025

How would you use this metric for alerting or to provide info about the deployment?

@SoumyaRaikwar
Copy link
Author

How would you use this metric for alerting or to provide info about the deployment?

These topology spread constraint metrics enable critical alerting on workload distribution policies: kube_deployment_spread_topology_constraint_metric > 0 helps detect when spread constraints exist but workloads become unevenly distributed across zones/nodes during resource pressure.

You can alert on missing distribution policies with (kube_deployment_spec_replicas > 1) and (kube_deployment_spread_topology_constraint_metric == 0) to identify multi-replica deployments lacking proper spread configuration.

For dashboards, count(kube_deployment_spread_topology_constraint_metric > 0) shows cluster-wide adoption of topology spread policies, complementing the pod affinity/anti-affinity metrics I implemented in PR #2733.

During incidents, these metrics help correlate why workloads became concentrated in specific topology domains or why pods failed to schedule due to overly restrictive spread policies.

This completes the scheduling observability suite from issue #2701 - together with my pod affinity/anti-affinity metrics (PR #2733), operators now have full visibility into both co-location/separation rules AND even distribution policies across cluster topology. Thanks @mrueg!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025
@mrueg
Copy link
Member

mrueg commented Aug 14, 2025

Same comment as in the other PR #2733 applies here, we should have explicit metrics per kube_deployment_topology_spread_constraint{} and not simply count a length.

@SoumyaRaikwar
Copy link
Author

Add explicit topology spread constraint metric for deployments
Replaces kube_deployment_spec_topology_spread_constraints (count-only) with kube_deployment_spec_topology_spread_constraint that exposes detailed labels per constraint: topology_key, max_skew, when_unsatisfiable, min_domains, and label_selector.

This enables precise Prometheus queries and better observability of constraint configurations across deployments.

Thanks @mrueg for suggesting to make this metric explicit! ✓ Tests updated ✓ Documentation updated

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 19, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Instrumentation Aug 26, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2025
@SoumyaRaikwar SoumyaRaikwar force-pushed the add-deployment-topology-spread-constraints-metric branch from a96beb0 to 679d205 Compare August 30, 2025 22:51
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 30, 2025
@rexagod
Copy link
Member

rexagod commented Sep 4, 2025

/assign @mrueg
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 4, 2025
@rexagod rexagod moved this from Needs Triage to Needs Review (PR) or Response (Issue) in SIG Instrumentation Sep 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SoumyaRaikwar
Once this PR has been reviewed and has the lgtm label, please ask for approval from mrueg. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 7, 2025
@SoumyaRaikwar
Copy link
Author

Hi @mrueg,
When you have a moment, could you please review my recent PR?

- Adds new metric to count topology spread constraints in deployment pod templates
- Includes comprehensive test coverage for both cases (with/without constraints)
- Follows existing patterns and stability guidelines
- Add document start markers (---) to all YAML files
- Fix indentation errors throughout manifests
- Resolve line length violations
- Correct YAML list formatting
- Update image name to remove -amd64 suffix
- Aligns with project standards and CI expectations
- Resolves validate-manifests target failures
- Ensures consistency with automated generation workflow
- Fix formatting issues in deployment.go
- Update deployment_test.go with proper topology spread constraint tests
- Add missing deletion timestamp metric documentation
- Fix deployment status condition reason labels in tests
- Ensure all tests pass with proper metric values
@SoumyaRaikwar SoumyaRaikwar force-pushed the add-deployment-topology-spread-constraints-metric branch from 623acad to 71b908f Compare September 19, 2025 14:07
@SoumyaRaikwar
Copy link
Author

Hi @mrueg,
When you have a moment, could you please review my PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Needs Review (PR) or Response (Issue)
Development

Successfully merging this pull request may close these issues.

4 participants