Skip to content

Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604

Open
rustyrazorblade wants to merge 5 commits intomainfrom
multi-cluster-dashboards
Open

Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604
rustyrazorblade wants to merge 5 commits intomainfrom
multi-cluster-dashboards

Conversation

@rustyrazorblade
Copy link
Copy Markdown
Owner

No description provided.

…luster comparison dashboard

- Add cluster multi-select variable and adhocfilters to all 11 dashboards
- Fix adhocfilters to include applyMode and baseFilters required by Grafana 10+
- Add VictoriaMetrics httpMethod=POST for label API compatibility
- Use name-UUID format for cluster metric label to ensure uniqueness across clusters
- Add new cluster-comparison.json dashboard with 8 sections showcasing all panel types:
  Polystat, Table, Time Series, Bar Chart, Status History, Heatmap, Percentile Ladder
- Add stress job section showing client-observed throughput and latency per cluster
- Register GrafanaDashboard.CLUSTER_COMPARISON as optional enum entry
@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

PR Review: Multi-cluster dashboards

The bulk of this PR (~6000 lines) is JSON reformatting of existing dashboards from compact to pretty-printed format — functional no-ops. The substantive changes are: new clusterLabelName() extension, new cluster comparison dashboard, ClickHouse datasource removal, VictoriaMetrics httpMethod:POST, and cluster variable support in ClickHouse dashboards.

ClickHouse datasource removal: The GrafanaDatasource entry for ClickHouse is removed from GrafanaDatasourceConfig, but GrafanaDashboard.CLICKHOUSE and GrafanaDashboard.CLICKHOUSE_LOGS remain registered. ClickHouse dashboards will now have no provisioned datasource after grafana update-config runs. I understand the old config was broken (hardcoded db0 hostname), but removing without a replacement leaves ClickHouse users with broken dashboards. Was this intentional? If ClickHouse datasource configuration is being deferred to a separate mechanism, that should be documented.

SetupInstance cluster name change: This changes the CLUSTER_NAME env var written to stress instances from just the cluster name to {name}-{clusterId}. This makes the label consistent with what OTel collector assigns, which is the right call for multi-cluster support. Worth calling out explicitly that re-running setup-instance on an existing stress node will change its reported cluster label, which could affect Grafana filter continuity mid-run.

OpenSpec change not archived: openspec/changes/cluster-comparison-dashboard/ is present but not archived. If the implementation is complete, it should be moved to archive/ as part of this PR.

httpMethod: POST for VictoriaMetrics: Good change — avoids URL length limits on large PromQL queries.

Cluster comparison dashboard: Well-structured. histogram_quantile with sum by (le, cluster) is mathematically valid for per-cluster aggregation. The seven-section layout covering Polystat, Time Series, Bar Chart, Status History, Heatmap, and Percentile Ladder is a solid reference implementation. All panels correctly use cluster=~"$cluster" regex-match to support multi-select. The cluster2 variable for side-by-side heatmaps is a nice touch.

JSON reformatting: The pretty-printing changes inflate this diff significantly and make the real changes hard to find. Consider separating these into a dedicated commit in future PRs.

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

PR Review

Overall: Good multi-cluster foundation. The Kotlin changes are minimal and clean.


OpenSpec: cluster-comparison-dashboard not archived

openspec/changes/cluster-comparison-dashboard/ is present but unarchived while multi-cluster-dashboards is archived. If implementation is complete, move it to archive/ before merging.


ClickHouse datasource removal — context clarification

The existing review comment flags this as potentially breaking, but the spec correctly justifies it: the native ClickHouse datasource was confirmed unused (all ClickHouse dashboard panels query VictoriaMetrics). Removing it is the right call. No issue here.


clusterLabelName() — potential label continuity issue

fun ClusterState.clusterLabelName(): String = "$name-$clusterId"

This is a good addition and consistent with clusterPrefix() and metricsConfigId(). However, SetupInstance uses this to set CLUSTER_NAME on stress nodes. For any existing running cluster, re-running setup-instance will silently change the label OTel reports (from bare name to name-clusterId), breaking Grafana filter continuity mid-run. Worth a note in the command output or docs for operators with existing clusters.


httpMethod: POST for VictoriaMetrics — good change

Correct fix for large PromQL queries that exceed URL length limits. No issues.


JSON reformatting

The ~8000-line diff is dominated by pretty-printing existing dashboard JSON. The actual functional changes are small. Consider separating formatting-only commits in future PRs to make functional changes easier to review.


Blocking: Archive the cluster-comparison-dashboard OpenSpec change. Everything else is non-blocking.

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

Code Review

Overall this is a solid PR - the clusterLabelName() extension cleanly centralizes the cluster identifier format, the dashboard variable work is consistent, and the httpMethod: POST addition for VictoriaMetrics is a good operational improvement. A few things worth checking:


1. OTel: action: insert will not override existing cluster labels

In otel-collector-config.yaml, the new processor uses action: insert. The insert action is a no-op if the attribute already exists. MAAC (the Cassandra metrics exporter) ships its own cluster label using Cassandra's internal cluster name. MAAC metrics that already carry a cluster attribute will not get the {name}-{clusterId} value stamped on them.

The dashboard queries filter on cluster=~"$cluster". If the $cluster variable is populated from the edl cluster label but MAAC metrics carry a different internal value, those panels will show no data.

Consider: Should this be action: upsert? Or is the intent to let MAAC's internal cluster name win - in which case the dashboard variable must be populated from MAAC label values rather than the env-var-derived one?


2. ClickHouse datasource removed but ClickHouse dashboards remain

GrafanaDatasourceConfig.kt removes the ClickHouse datasource provisioning entry, but clickhouse.json and clickhouse-logs.json are still present and reference grafana-clickhouse-datasource. Any user running a ClickHouse cluster will see "datasource not found" on those dashboards after this change.

Was this intentional? If ClickHouse requires a separate opt-in provision step, that should be documented.


3. metrics/otlp pipeline does not get the cluster label

The resource/cluster processor is added only to metrics/local. OTLP metrics (Spark, etc.) flow through metrics/otlp which only has batch. If multi-cluster filtering is expected to work on Spark metrics, they will be missing the cluster label entirely. Fine to defer if out of scope, but worth a follow-up ticket.


Minor

  • clusterLabelName() has no length guard - metricsConfigId() uses .take(Constants...) because it ends up in K8s resource names. Label values do not have that constraint, so this is fine; just noting it is intentional.
  • Comment fix in MetricsCollector.kt (cassandra-condensed.json to cassandra-overview.json) is correct and appreciated.
  • Replacing clusterState.initConfig?.name ?: "cluster" with clusterLabelName() is a good cleanup - it eliminates a silent fallback that would have made multi-cluster filtering ambiguous.

- Remove p98 from read/write latency panels, switch to $__rate_interval
- Drop "All Coordinator Latencies (p99)" panel (wrong metric, redundant)
- Drop "Nodes Status" section (inaccurate)
- Add "Per-Node Latency" section with read/write p99 per host_name
- Add Compaction Throughput panel (Bps) alongside Pending Compactions
- Fix CPU Utilization legend to include cluster name
@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

PR Review: Multi-cluster dashboards

Overall this is a solid, well-scoped change. The OpenSpec design docs are thorough and the Kotlin changes are small and focused. A few things worth discussing:


metrics/otlp pipeline missing resource/cluster processor

The resource/cluster processor is added to metrics/local but not metrics/otlp (Spark nodes). The comment says Spark nodes "already have host.name and node_role" but doesn't address the cluster label. If Spark metrics are ever visualized in the cluster-comparison dashboard or filtered by cluster=~"\$cluster", they'll silently show no data. Worth confirming whether Spark OTLP traffic already carries the cluster label from its sender, or whether this pipeline also needs the processor.


clusterLabelName() — no length guardrail

metricsConfigId() truncates via .take(Constants.S3.MAX_METRICS_CONFIG_ID_LENGTH) to satisfy S3 API limits. clusterLabelName() returns the raw "$name-$clusterId" string without any truncation. Prometheus label values have no hard length limit, so this is likely fine in practice, but it's worth noting the inconsistency — especially since clusterId is a UUID, making the resulting string predictably long.


cassandra-condensed.json removal — silent breaking change for existing deployments

Existing deployments that previously ran grafana update-config will have a grafana-dashboard-cassandra-condensed ConfigMap and a corresponding Grafana volume mount still in place. Since the enum entry is gone, re-running grafana update-config will no longer provision or clean up that ConfigMap. The old dashboard will keep appearing in Grafana until someone manually deletes the stale ConfigMap. Is there cleanup logic elsewhere that handles removed enum entries, or does this need a migration note in the docs?


httpMethod: POST for VictoriaMetrics — good defensive fix

This is a correct addition. Instant query responses from VictoriaMetrics for complex label_values(...) variable queries can hit URI length limits on GET. POST avoids that. No concerns here.


Dashboard JSON formatting churn

The bulk of the line count in this PR (~4000+ lines) is JSON reformatting — compact inline objects expanded to multi-line. This has no functional effect but makes the diff very hard to review. If there's a canonical formatter for these files it would be worth running consistently. Not a blocker, just a note for future reviewers.


Minor: MetricsCollector.kt comment update is correct

// Cassandra queries — match Grafana cassandra-overview.json dashboard — this is the right reference now that cassandra-condensed.json is removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant