Skip to content

Conversation

@Gabbe16
Copy link
Contributor

@Gabbe16 Gabbe16 commented Nov 7, 2025

Added cpu/memory requested panels for capacity management dashboard

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

What does this PR do / why do we need this PR?

This PR adds new panels for the rows CPU/Memory usage in the capacity management dashboard. The panels are as follows; requested CPU/Memory per node, average requested CPU/Memory per node group, CPU/Memory limits per node and average CPU/Memory limits per node group. This is so you can easily compare usage, requests and limits in one place as per stated in the issue that this would solve. Below there are two example images of how this looks like on the Grafana capacity management dashboard.

New CPU dashboards
Screenshot from 2025-11-25 13-18-15

New Memory dashboards
Screenshot from 2025-11-25 13-18-23

Information to reviewers

The Promql queries for these dashboards looks like the following, if you feel that you would want to give feedback or to test them. I'm only putting the cpu queries here since the memory panels look basically the same except that "resource" is instead set to memory in the queries.

CPU requested

sum by (node) (kube_pod_container_resource_requests{cluster=~"$cluster", resource="cpu"} and on (pod, namespace, cluster) kube_pod_info{cluster=~"$cluster"} and on (pod, namespace, cluster) kube_pod_status_phase{phase="Running", cluster=~"$cluster"} == 1 ) / sum by (node) (kube_node_status_allocatable{cluster=~"$cluster", resource="cpu"}) * on (node) group_left (label_elastisys_io_node_group) label_replace(kube_node_labels{label_elastisys_io_node_group=~"$NodeGroup"}, "instance", "$1", "node", "(.*)")

Average CPU requested

avg(sum by (node) (kube_pod_container_resource_requests{cluster=~"$cluster", resource="cpu"} and on (pod, namespace, cluster) kube_pod_info{cluster=~"$cluster"} and on (pod, namespace, cluster) kube_pod_status_phase{phase="Running", cluster=~"$cluster"} == 1 ) / sum by (node) (kube_node_status_allocatable{cluster=~"$cluster", resource="cpu"}) * on (node) group_left (label_elastisys_io_node_group) label_replace(kube_node_labels{label_elastisys_io_node_group=~"$NodeGroup"}, "instance", "$1", "node", "(.*)"))

CPU limits

sum by (node) (kube_pod_container_resource_limits{cluster=~"$cluster", resource="cpu"} and on (pod, namespace, cluster) kube_pod_info{cluster=~"$cluster"} and on (pod, namespace, cluster) kube_pod_status_phase{phase="Running", cluster=~"$cluster"} == 1 ) / sum by (node) (kube_node_status_allocatable{cluster=~"$cluster", resource="cpu"}) * on (node) group_left (label_elastisys_io_node_group) label_replace(kube_node_labels{label_elastisys_io_node_group=~"$NodeGroup"}, "instance", "$1", "node", "(.*)")

Average CPU limits

avg(sum by (node) (kube_pod_container_resource_limits{cluster=~"$cluster", resource="cpu"} and on (pod, namespace, cluster) kube_pod_info{cluster=~"$cluster"} and on (pod, namespace, cluster) kube_pod_status_phase{phase="Running", cluster=~"$cluster"} == 1 ) / sum by (node) (kube_node_status_allocatable{cluster=~"$cluster", resource="cpu"}) * on (node) group_left (label_elastisys_io_node_group) label_replace(kube_node_labels{label_elastisys_io_node_group=~"$NodeGroup"}, "instance", "$1", "node", "(.*)"))

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@Gabbe16 Gabbe16 requested a review from a team as a code owner November 7, 2025 10:22
@Gabbe16 Gabbe16 added the kind/improvement Improvement of existing features, e.g. code cleanup or optimizations. label Nov 7, 2025
Copy link
Contributor

@Zash Zash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queries look sensible.

@Gabbe16 Gabbe16 force-pushed the gabbe16/add-requests-panels-capacity-management-dashboard branch from c519b9a to 0bc2b95 Compare November 10, 2025 12:15
@Gabbe16 Gabbe16 requested a review from Zash November 10, 2025 12:19
@Gabbe16 Gabbe16 force-pushed the gabbe16/add-requests-panels-capacity-management-dashboard branch 2 times, most recently from f19a99e to f15c184 Compare November 14, 2025 10:10
@Gabbe16
Copy link
Contributor Author

Gabbe16 commented Nov 14, 2025

The PR description has now been updated to reflect the newest implementations suggested by @viktor-f . If you already read the past one please reread it and give more feedback, thanks!

@Gabbe16 Gabbe16 requested a review from viktor-f November 14, 2025 10:31
…oard

fix: set panel collapse to false

fix: panels being broken and not showing up

apps: restructured panels and added new panels for cpu/memory regarding the average usage and limits

apps: added average CPU/Memory limits dashboards to complete the set + dashboard name changes
@Gabbe16 Gabbe16 force-pushed the gabbe16/add-requests-panels-capacity-management-dashboard branch from f15c184 to ebbeb0c Compare November 25, 2025 12:30
Copy link
Contributor

@viktor-f viktor-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for doing the extra suggestions as well 🚀

@Gabbe16 Gabbe16 merged commit 5fa2791 into main Dec 1, 2025
13 checks passed
@Gabbe16 Gabbe16 deleted the gabbe16/add-requests-panels-capacity-management-dashboard branch December 1, 2025 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/improvement Improvement of existing features, e.g. code cleanup or optimizations.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add requests panels in Capacity Management Dashboard

3 participants