Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604
Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604rustyrazorblade wants to merge 10 commits intomainfrom
Conversation
PR Review: Multi-cluster dashboardsThe bulk of this PR (~6000 lines) is JSON reformatting of existing dashboards from compact to pretty-printed format — functional no-ops. The substantive changes are: new clusterLabelName() extension, new cluster comparison dashboard, ClickHouse datasource removal, VictoriaMetrics httpMethod:POST, and cluster variable support in ClickHouse dashboards. ClickHouse datasource removal: The GrafanaDatasource entry for ClickHouse is removed from GrafanaDatasourceConfig, but GrafanaDashboard.CLICKHOUSE and GrafanaDashboard.CLICKHOUSE_LOGS remain registered. ClickHouse dashboards will now have no provisioned datasource after grafana update-config runs. I understand the old config was broken (hardcoded db0 hostname), but removing without a replacement leaves ClickHouse users with broken dashboards. Was this intentional? If ClickHouse datasource configuration is being deferred to a separate mechanism, that should be documented. SetupInstance cluster name change: This changes the CLUSTER_NAME env var written to stress instances from just the cluster name to {name}-{clusterId}. This makes the label consistent with what OTel collector assigns, which is the right call for multi-cluster support. Worth calling out explicitly that re-running setup-instance on an existing stress node will change its reported cluster label, which could affect Grafana filter continuity mid-run. OpenSpec change not archived: openspec/changes/cluster-comparison-dashboard/ is present but not archived. If the implementation is complete, it should be moved to archive/ as part of this PR. httpMethod: POST for VictoriaMetrics: Good change — avoids URL length limits on large PromQL queries. Cluster comparison dashboard: Well-structured. histogram_quantile with sum by (le, cluster) is mathematically valid for per-cluster aggregation. The seven-section layout covering Polystat, Time Series, Bar Chart, Status History, Heatmap, and Percentile Ladder is a solid reference implementation. All panels correctly use cluster=~"$cluster" regex-match to support multi-select. The cluster2 variable for side-by-side heatmaps is a nice touch. JSON reformatting: The pretty-printing changes inflate this diff significantly and make the real changes hard to find. Consider separating these into a dedicated commit in future PRs. |
PR ReviewOverall: Good multi-cluster foundation. The Kotlin changes are minimal and clean. OpenSpec:
|
Code ReviewOverall this is a solid PR - the 1. OTel:
|
PR Review: Multi-cluster dashboardsOverall this is a solid, well-scoped change. The OpenSpec design docs are thorough and the Kotlin changes are small and focused. A few things worth discussing:
|
PR Review: Multi-Cluster DashboardsOverall this is a solid feature. The cluster variable and ad hoc filter additions are consistent across all dashboards, and the Bug:
|
PR Review: Multi-cluster dashboardsGood, well-scoped change. The OpenSpec-driven approach with clear design docs is solid. A few things worth looking at: OTel: In Potential double-stamping of The prometheus scrape configs already set
Positive notes
|
…luster comparison dashboard - Add cluster multi-select variable and adhocfilters to all 11 dashboards - Fix adhocfilters to include applyMode and baseFilters required by Grafana 10+ - Add VictoriaMetrics httpMethod=POST for label API compatibility - Use name-UUID format for cluster metric label to ensure uniqueness across clusters - Add new cluster-comparison.json dashboard with 8 sections showcasing all panel types: Polystat, Table, Time Series, Bar Chart, Status History, Heatmap, Percentile Ladder - Add stress job section showing client-observed throughput and latency per cluster - Register GrafanaDashboard.CLUSTER_COMPARISON as optional enum entry
…on to Cassandra Cluster Comparison
- Remove p98 from read/write latency panels, switch to $__rate_interval - Drop "All Coordinator Latencies (p99)" panel (wrong metric, redundant) - Drop "Nodes Status" section (inaccurate) - Add "Per-Node Latency" section with read/write p99 per host_name - Add Compaction Throughput panel (Bps) alongside Pending Compactions - Fix CPU Utilization legend to include cluster name
- Active Tasks converted to full-width stacked area chart (first in section) - Reorder: Active → Pending → Blocked → Dropped Messages → Hinted Handoff - Dropped Messages and Hinted Handoff shrunk to w=8 to match 3-panel rows - Shift Hardware/OS and JVM sections down to avoid overlap - Drop Disk Space Usage panel (filesystem metrics not available from OTel container)
…to stacked area - Rename "Request Throughputs (Coordinator Perspective)" row to "Cluster Overview" - Convert Request Throughputs panel to stacked area (fillOpacity=80, lineWidth=0)
…improvements - Convert Error Throughputs to stacked area chart - Remove Read/Write Distribution panel (redundant with stacked throughput) - Move Pending Compactions into Cluster Overview section - Move Hardware/OS section to second position for faster triage - Filter Hardware/OS panels to db nodes only (host_name=~"db.*")
…gend order - Move Active Tasks by Pool into Cluster Overview section - Swap legend order to host/pool first, cluster second across all panels
…mpaction throughput - Add SSTables per Read (p99) time series panel to Data Status section - Fix Compaction Throughput to show per-host instead of cluster total
7433108 to
5d22765
Compare
PR Review: Add multi-cluster dashboardsOverall this is a solid set of changes that properly wires multi-cluster awareness end-to-end through the observability stack. A few things worth discussing below.
|
clusterLabelName() — no length cap |
Low risk, but worth a comment |
SetupInstance — CLUSTER_NAME format change |
Needs verification — may break stress node driver connection if used for Cassandra cluster name matching |
GrafanaUpdateConfig — removed null fallback |
Low risk if ClusterState is always populated here |
| ClickHouse datasource removed while dashboards remain | Needs clarification — will cause broken datasource errors in Grafana |
OTel insert vs upsert |
Minor, low impact |
No description provided.