Skip to content

Conversation

SoumyaRaikwar
Copy link

What this PR does / why we need it

This PR adds the kube_deployment_spec_topology_spread_constraints metric that counts the number of topology spread constraints defined in a deployment's pod template specification.

This PR solves the topology spread constraints monitoring requirement from issue #2701, which specifically requested visibility into scheduling primitives including "pod topology spread constraints" for workload pod distribution monitoring.

Which issue(s) this PR fixes

Solves topology spread constraints monitoring from #2701 - Add schedule spec and status for workload

Problem Solved

Issue #2701 identified that operators need to monitor various scheduling primitives to detect when "break variation may happen because pod priority preemption or node pressure eviction."

- Adds new metric to count topology spread constraints in deployment pod templates
- Includes comprehensive test coverage for both cases (with/without constraints)
- Follows existing patterns and stability guidelines
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 10, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SoumyaRaikwar
Once this PR has been reviewed and has the lgtm label, please assign rexagod for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 10, 2025
@SoumyaRaikwar SoumyaRaikwar changed the title Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 feat: Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 Aug 10, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 10, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 10, 2025
@mrueg
Copy link
Member

mrueg commented Aug 13, 2025

How would you use this metric for alerting or to provide info about the deployment?

@SoumyaRaikwar
Copy link
Author

How would you use this metric for alerting or to provide info about the deployment?

These topology spread constraint metrics enable critical alerting on workload distribution policies: kube_deployment_spread_topology_constraint_metric > 0 helps detect when spread constraints exist but workloads become unevenly distributed across zones/nodes during resource pressure.

You can alert on missing distribution policies with (kube_deployment_spec_replicas > 1) and (kube_deployment_spread_topology_constraint_metric == 0) to identify multi-replica deployments lacking proper spread configuration.

For dashboards, count(kube_deployment_spread_topology_constraint_metric > 0) shows cluster-wide adoption of topology spread policies, complementing the pod affinity/anti-affinity metrics I implemented in PR #2733.

During incidents, these metrics help correlate why workloads became concentrated in specific topology domains or why pods failed to schedule due to overly restrictive spread policies.

This completes the scheduling observability suite from issue #2701 - together with my pod affinity/anti-affinity metrics (PR #2733), operators now have full visibility into both co-location/separation rules AND even distribution policies across cluster topology. Thanks @mrueg!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025
@mrueg
Copy link
Member

mrueg commented Aug 14, 2025

Same comment as in the other PR #2733 applies here, we should have explicit metrics per kube_deployment_topology_spread_constraint{} and not simply count a length.

Replace kube_deployment_spec_topology_spread_constraints count metric
with kube_deployment_spec_topology_spread_constraint that exposes
detailed information for each constraint including topology_key,
max_skew, when_unsatisfiable, min_domains, and label_selector labels.

This enables more granular monitoring and alerting on topology spread
constraint configurations across deployments.

- Add explicit per-constraint metric generation
- Handle nil MinDomains with default value of 1
- Add proper LabelSelector string conversion
- Update tests with constraint details
- Mark metric as ALPHA stability
- Update documentation tables

Closes: #<issue-number-if-applicable>
… for topology spread constraints in deployments.
@SoumyaRaikwar
Copy link
Author

Add explicit topology spread constraint metric for deployments
Replaces kube_deployment_spec_topology_spread_constraints (count-only) with kube_deployment_spec_topology_spread_constraint that exposes detailed labels per constraint: topology_key, max_skew, when_unsatisfiable, min_domains, and label_selector.

This enables precise Prometheus queries and better observability of constraint configurations across deployments.

Thanks @mrueg for suggesting to make this metric explicit! ✓ Tests updated ✓ Documentation updated

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 19, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2025
@SoumyaRaikwar SoumyaRaikwar force-pushed the add-deployment-topology-spread-constraints-metric branch from a96beb0 to 679d205 Compare August 30, 2025 22:51
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2025
SoumyaRaikwar and others added 3 commits September 1, 2025 13:29
- Add document start markers (---) to all YAML files
- Fix indentation errors throughout manifests
- Resolve line length violations
- Correct YAML list formatting
- Update image name to remove -amd64 suffix
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 1, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

3 participants