Skip to content

Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604

Open
rustyrazorblade wants to merge 2 commits intomainfrom
multi-cluster-dashboards
Open

Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604
rustyrazorblade wants to merge 2 commits intomainfrom
multi-cluster-dashboards

Conversation

@rustyrazorblade
Copy link
Copy Markdown
Owner

No description provided.

…luster comparison dashboard

- Add cluster multi-select variable and adhocfilters to all 11 dashboards
- Fix adhocfilters to include applyMode and baseFilters required by Grafana 10+
- Add VictoriaMetrics httpMethod=POST for label API compatibility
- Use name-UUID format for cluster metric label to ensure uniqueness across clusters
- Add new cluster-comparison.json dashboard with 8 sections showcasing all panel types:
  Polystat, Table, Time Series, Bar Chart, Status History, Heatmap, Percentile Ladder
- Add stress job section showing client-observed throughput and latency per cluster
- Register GrafanaDashboard.CLUSTER_COMPARISON as optional enum entry
@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

PR Review: Multi-cluster dashboards

The bulk of this PR (~6000 lines) is JSON reformatting of existing dashboards from compact to pretty-printed format — functional no-ops. The substantive changes are: new clusterLabelName() extension, new cluster comparison dashboard, ClickHouse datasource removal, VictoriaMetrics httpMethod:POST, and cluster variable support in ClickHouse dashboards.

ClickHouse datasource removal: The GrafanaDatasource entry for ClickHouse is removed from GrafanaDatasourceConfig, but GrafanaDashboard.CLICKHOUSE and GrafanaDashboard.CLICKHOUSE_LOGS remain registered. ClickHouse dashboards will now have no provisioned datasource after grafana update-config runs. I understand the old config was broken (hardcoded db0 hostname), but removing without a replacement leaves ClickHouse users with broken dashboards. Was this intentional? If ClickHouse datasource configuration is being deferred to a separate mechanism, that should be documented.

SetupInstance cluster name change: This changes the CLUSTER_NAME env var written to stress instances from just the cluster name to {name}-{clusterId}. This makes the label consistent with what OTel collector assigns, which is the right call for multi-cluster support. Worth calling out explicitly that re-running setup-instance on an existing stress node will change its reported cluster label, which could affect Grafana filter continuity mid-run.

OpenSpec change not archived: openspec/changes/cluster-comparison-dashboard/ is present but not archived. If the implementation is complete, it should be moved to archive/ as part of this PR.

httpMethod: POST for VictoriaMetrics: Good change — avoids URL length limits on large PromQL queries.

Cluster comparison dashboard: Well-structured. histogram_quantile with sum by (le, cluster) is mathematically valid for per-cluster aggregation. The seven-section layout covering Polystat, Time Series, Bar Chart, Status History, Heatmap, and Percentile Ladder is a solid reference implementation. All panels correctly use cluster=~"$cluster" regex-match to support multi-select. The cluster2 variable for side-by-side heatmaps is a nice touch.

JSON reformatting: The pretty-printing changes inflate this diff significantly and make the real changes hard to find. Consider separating these into a dedicated commit in future PRs.

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

PR Review

Overall: Good multi-cluster foundation. The Kotlin changes are minimal and clean.


OpenSpec: cluster-comparison-dashboard not archived

openspec/changes/cluster-comparison-dashboard/ is present but unarchived while multi-cluster-dashboards is archived. If implementation is complete, move it to archive/ before merging.


ClickHouse datasource removal — context clarification

The existing review comment flags this as potentially breaking, but the spec correctly justifies it: the native ClickHouse datasource was confirmed unused (all ClickHouse dashboard panels query VictoriaMetrics). Removing it is the right call. No issue here.


clusterLabelName() — potential label continuity issue

fun ClusterState.clusterLabelName(): String = "$name-$clusterId"

This is a good addition and consistent with clusterPrefix() and metricsConfigId(). However, SetupInstance uses this to set CLUSTER_NAME on stress nodes. For any existing running cluster, re-running setup-instance will silently change the label OTel reports (from bare name to name-clusterId), breaking Grafana filter continuity mid-run. Worth a note in the command output or docs for operators with existing clusters.


httpMethod: POST for VictoriaMetrics — good change

Correct fix for large PromQL queries that exceed URL length limits. No issues.


JSON reformatting

The ~8000-line diff is dominated by pretty-printing existing dashboard JSON. The actual functional changes are small. Consider separating formatting-only commits in future PRs to make functional changes easier to review.


Blocking: Archive the cluster-comparison-dashboard OpenSpec change. Everything else is non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant