Skip to content

Evaluate Prometheus and Trivy Operator dashboards introduced in #2834 #2891

@chi-quita-a

Description

@chi-quita-a

Background

On 2025-12-04, #2834 introduced a curated set of Kubernetes dashboards from the dotdc/grafana-dashboards-kubernetes project. The Views dashboards (Global → Namespaces → Nodes → Pods) have been well-received and clearly enhance Welkin’s Grafana experience.

Two of the additional dashboards, however, overlap with existing ones already provided in our current setup:

  • Trivy Operator dashboard
  • Prometheus dashboard

During review, the question was raised whether these should co-exist with the current equivalents, or whether one version should replace the other.

Open Questions (design decision)

Trivy Operator dashboard

Behavioural difference:

  • Current dashboard → sums vulnerabilities per unique image
  • New dashboard → sums vulnerabilities per pod (risk inflation if same image runs in many pods)

Decision options

  • Keep both dashboards (different use-cases)
  • Replace the current one
  • Remove the new one

Prometheus dashboard

Functional difference:

  • Current one includes time-series per job and is laser-focused on Prometheus metrics
  • New one adds CPU / memory for all pods, which may not belong in a Prometheus-centric dashboard and duplicates metrics elsewhere

Decision options

  • Keep both (broader operator visibility)
  • Replace the current one
  • Remove the new one

Proposed approach

Treat both dashboards as experimental for the current sprint.
Operators/SME team to test and document:

  • Monitoring coverage
  • Duplication vs. complementarity
  • Usability / clarity of dashboard purpose
  • Dashboard performance (loading / navigation)

Based on findings, we decide in the next sprint whether to:

  • retain both,
  • replace existing dashboards, or
  • drop the new additions.

Checklist for this issue

  • Confirm scope of each Prometheus and Trivy dashboard
  • Decide desired vulnerability aggregation model (image vs pod)
  • Identify duplication of metrics or panels
  • Document recommendation + rationale
  • Update the dashboard set accordingly in a follow-up PR

Notes for reviewers

Initial impressions from review were positive regarding design and navigation. This issue aims to ensure long-term consistency and avoid creeping dashboard duplication.

Additional context

No response

Definition of done

This issue is complete when:

  1. Evaluation completed
    The functional differences between the new and existing Prometheus and Trivy Operator dashboards are fully assessed and documented:

    • Purpose and intended audience of each dashboard
    • Metric query differences (e.g. vulnerabilities per image vs per pod)
    • Identified overlaps or gaps in monitoring coverage
    • Usability and performance considerations
  2. Decision recorded
    For each dashboard, one of the following outcomes is explicitly agreed by the team:

    • Keep both
    • Replace the existing dashboard
    • Remove the newly added dashboard
  3. Action plan defined
    A follow-up PR is clearly scoped, indicating:

    • Which dashboards to keep/remove
    • Any query or panel modifications required for consistency
    • Expected impact on users
  4. Approval
    The team confirm the decision in comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions