-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Background
On 2025-12-04, #2834 introduced a curated set of Kubernetes dashboards from the dotdc/grafana-dashboards-kubernetes project. The Views dashboards (Global → Namespaces → Nodes → Pods) have been well-received and clearly enhance Welkin’s Grafana experience.
Two of the additional dashboards, however, overlap with existing ones already provided in our current setup:
- Trivy Operator dashboard
- Prometheus dashboard
During review, the question was raised whether these should co-exist with the current equivalents, or whether one version should replace the other.
Open Questions (design decision)
Trivy Operator dashboard
Behavioural difference:
- Current dashboard → sums vulnerabilities per unique image
- New dashboard → sums vulnerabilities per pod (risk inflation if same image runs in many pods)
Decision options
- Keep both dashboards (different use-cases)
- Replace the current one
- Remove the new one
Prometheus dashboard
Functional difference:
- Current one includes time-series per job and is laser-focused on Prometheus metrics
- New one adds CPU / memory for all pods, which may not belong in a Prometheus-centric dashboard and duplicates metrics elsewhere
Decision options
- Keep both (broader operator visibility)
- Replace the current one
- Remove the new one
Proposed approach
Treat both dashboards as experimental for the current sprint.
Operators/SME team to test and document:
- Monitoring coverage
- Duplication vs. complementarity
- Usability / clarity of dashboard purpose
- Dashboard performance (loading / navigation)
Based on findings, we decide in the next sprint whether to:
- retain both,
- replace existing dashboards, or
- drop the new additions.
Checklist for this issue
- Confirm scope of each Prometheus and Trivy dashboard
- Decide desired vulnerability aggregation model (image vs pod)
- Identify duplication of metrics or panels
- Document recommendation + rationale
- Update the dashboard set accordingly in a follow-up PR
Notes for reviewers
Initial impressions from review were positive regarding design and navigation. This issue aims to ensure long-term consistency and avoid creeping dashboard duplication.
Additional context
No response
Definition of done
This issue is complete when:
-
Evaluation completed
The functional differences between the new and existing Prometheus and Trivy Operator dashboards are fully assessed and documented:- Purpose and intended audience of each dashboard
- Metric query differences (e.g. vulnerabilities per image vs per pod)
- Identified overlaps or gaps in monitoring coverage
- Usability and performance considerations
-
Decision recorded
For each dashboard, one of the following outcomes is explicitly agreed by the team:- Keep both
- Replace the existing dashboard
- Remove the newly added dashboard
-
Action plan defined
A follow-up PR is clearly scoped, indicating:- Which dashboards to keep/remove
- Any query or panel modifications required for consistency
- Expected impact on users
-
Approval
The team confirm the decision in comments.