Skip to content

Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604

Open
rustyrazorblade wants to merge 10 commits intomainfrom
multi-cluster-dashboards
Open

Add multi-cluster dashboards: cluster variable, ad hoc filters, and cluster comparison dashboard#604
rustyrazorblade wants to merge 10 commits intomainfrom
multi-cluster-dashboards

Conversation

@rustyrazorblade
Copy link
Copy Markdown
Owner

No description provided.

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

PR Review: Multi-cluster dashboards

The bulk of this PR (~6000 lines) is JSON reformatting of existing dashboards from compact to pretty-printed format — functional no-ops. The substantive changes are: new clusterLabelName() extension, new cluster comparison dashboard, ClickHouse datasource removal, VictoriaMetrics httpMethod:POST, and cluster variable support in ClickHouse dashboards.

ClickHouse datasource removal: The GrafanaDatasource entry for ClickHouse is removed from GrafanaDatasourceConfig, but GrafanaDashboard.CLICKHOUSE and GrafanaDashboard.CLICKHOUSE_LOGS remain registered. ClickHouse dashboards will now have no provisioned datasource after grafana update-config runs. I understand the old config was broken (hardcoded db0 hostname), but removing without a replacement leaves ClickHouse users with broken dashboards. Was this intentional? If ClickHouse datasource configuration is being deferred to a separate mechanism, that should be documented.

SetupInstance cluster name change: This changes the CLUSTER_NAME env var written to stress instances from just the cluster name to {name}-{clusterId}. This makes the label consistent with what OTel collector assigns, which is the right call for multi-cluster support. Worth calling out explicitly that re-running setup-instance on an existing stress node will change its reported cluster label, which could affect Grafana filter continuity mid-run.

OpenSpec change not archived: openspec/changes/cluster-comparison-dashboard/ is present but not archived. If the implementation is complete, it should be moved to archive/ as part of this PR.

httpMethod: POST for VictoriaMetrics: Good change — avoids URL length limits on large PromQL queries.

Cluster comparison dashboard: Well-structured. histogram_quantile with sum by (le, cluster) is mathematically valid for per-cluster aggregation. The seven-section layout covering Polystat, Time Series, Bar Chart, Status History, Heatmap, and Percentile Ladder is a solid reference implementation. All panels correctly use cluster=~"$cluster" regex-match to support multi-select. The cluster2 variable for side-by-side heatmaps is a nice touch.

JSON reformatting: The pretty-printing changes inflate this diff significantly and make the real changes hard to find. Consider separating these into a dedicated commit in future PRs.

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

PR Review

Overall: Good multi-cluster foundation. The Kotlin changes are minimal and clean.


OpenSpec: cluster-comparison-dashboard not archived

openspec/changes/cluster-comparison-dashboard/ is present but unarchived while multi-cluster-dashboards is archived. If implementation is complete, move it to archive/ before merging.


ClickHouse datasource removal — context clarification

The existing review comment flags this as potentially breaking, but the spec correctly justifies it: the native ClickHouse datasource was confirmed unused (all ClickHouse dashboard panels query VictoriaMetrics). Removing it is the right call. No issue here.


clusterLabelName() — potential label continuity issue

fun ClusterState.clusterLabelName(): String = "$name-$clusterId"

This is a good addition and consistent with clusterPrefix() and metricsConfigId(). However, SetupInstance uses this to set CLUSTER_NAME on stress nodes. For any existing running cluster, re-running setup-instance will silently change the label OTel reports (from bare name to name-clusterId), breaking Grafana filter continuity mid-run. Worth a note in the command output or docs for operators with existing clusters.


httpMethod: POST for VictoriaMetrics — good change

Correct fix for large PromQL queries that exceed URL length limits. No issues.


JSON reformatting

The ~8000-line diff is dominated by pretty-printing existing dashboard JSON. The actual functional changes are small. Consider separating formatting-only commits in future PRs to make functional changes easier to review.


Blocking: Archive the cluster-comparison-dashboard OpenSpec change. Everything else is non-blocking.

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

Code Review

Overall this is a solid PR - the clusterLabelName() extension cleanly centralizes the cluster identifier format, the dashboard variable work is consistent, and the httpMethod: POST addition for VictoriaMetrics is a good operational improvement. A few things worth checking:


1. OTel: action: insert will not override existing cluster labels

In otel-collector-config.yaml, the new processor uses action: insert. The insert action is a no-op if the attribute already exists. MAAC (the Cassandra metrics exporter) ships its own cluster label using Cassandra's internal cluster name. MAAC metrics that already carry a cluster attribute will not get the {name}-{clusterId} value stamped on them.

The dashboard queries filter on cluster=~"$cluster". If the $cluster variable is populated from the edl cluster label but MAAC metrics carry a different internal value, those panels will show no data.

Consider: Should this be action: upsert? Or is the intent to let MAAC's internal cluster name win - in which case the dashboard variable must be populated from MAAC label values rather than the env-var-derived one?


2. ClickHouse datasource removed but ClickHouse dashboards remain

GrafanaDatasourceConfig.kt removes the ClickHouse datasource provisioning entry, but clickhouse.json and clickhouse-logs.json are still present and reference grafana-clickhouse-datasource. Any user running a ClickHouse cluster will see "datasource not found" on those dashboards after this change.

Was this intentional? If ClickHouse requires a separate opt-in provision step, that should be documented.


3. metrics/otlp pipeline does not get the cluster label

The resource/cluster processor is added only to metrics/local. OTLP metrics (Spark, etc.) flow through metrics/otlp which only has batch. If multi-cluster filtering is expected to work on Spark metrics, they will be missing the cluster label entirely. Fine to defer if out of scope, but worth a follow-up ticket.


Minor

  • clusterLabelName() has no length guard - metricsConfigId() uses .take(Constants...) because it ends up in K8s resource names. Label values do not have that constraint, so this is fine; just noting it is intentional.
  • Comment fix in MetricsCollector.kt (cassandra-condensed.json to cassandra-overview.json) is correct and appreciated.
  • Replacing clusterState.initConfig?.name ?: "cluster" with clusterLabelName() is a good cleanup - it eliminates a silent fallback that would have made multi-cluster filtering ambiguous.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

PR Review: Multi-cluster dashboards

Overall this is a solid, well-scoped change. The OpenSpec design docs are thorough and the Kotlin changes are small and focused. A few things worth discussing:


metrics/otlp pipeline missing resource/cluster processor

The resource/cluster processor is added to metrics/local but not metrics/otlp (Spark nodes). The comment says Spark nodes "already have host.name and node_role" but doesn't address the cluster label. If Spark metrics are ever visualized in the cluster-comparison dashboard or filtered by cluster=~"\$cluster", they'll silently show no data. Worth confirming whether Spark OTLP traffic already carries the cluster label from its sender, or whether this pipeline also needs the processor.


clusterLabelName() — no length guardrail

metricsConfigId() truncates via .take(Constants.S3.MAX_METRICS_CONFIG_ID_LENGTH) to satisfy S3 API limits. clusterLabelName() returns the raw "$name-$clusterId" string without any truncation. Prometheus label values have no hard length limit, so this is likely fine in practice, but it's worth noting the inconsistency — especially since clusterId is a UUID, making the resulting string predictably long.


cassandra-condensed.json removal — silent breaking change for existing deployments

Existing deployments that previously ran grafana update-config will have a grafana-dashboard-cassandra-condensed ConfigMap and a corresponding Grafana volume mount still in place. Since the enum entry is gone, re-running grafana update-config will no longer provision or clean up that ConfigMap. The old dashboard will keep appearing in Grafana until someone manually deletes the stale ConfigMap. Is there cleanup logic elsewhere that handles removed enum entries, or does this need a migration note in the docs?


httpMethod: POST for VictoriaMetrics — good defensive fix

This is a correct addition. Instant query responses from VictoriaMetrics for complex label_values(...) variable queries can hit URI length limits on GET. POST avoids that. No concerns here.


Dashboard JSON formatting churn

The bulk of the line count in this PR (~4000+ lines) is JSON reformatting — compact inline objects expanded to multi-line. This has no functional effect but makes the diff very hard to review. If there's a canonical formatter for these files it would be worth running consistently. Not a blocker, just a note for future reviewers.


Minor: MetricsCollector.kt comment update is correct

// Cassandra queries — match Grafana cassandra-overview.json dashboard — this is the right reference now that cassandra-condensed.json is removed.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

PR Review: Multi-Cluster Dashboards

Overall this is a solid feature. The cluster variable and ad hoc filter additions are consistent across all dashboards, and the clusterLabelName() abstraction cleanly centralizes the name-clusterId format. A few issues worth addressing before merge:


Bug: resource/cluster processor only applied to metrics/local

In otel-collector-config.yaml, the new processor is wired into metrics/local but not metrics/otlp or the log pipelines:

metrics/local:
  processors: [resource/cluster, resourcedetection, k8sattributes, batch]
metrics/otlp:
  processors: [batch]   # ← no resource/cluster
logs/local:
  processors: [resourcedetection, k8sattributes, batch]   # ← no resource/cluster
logs/otlp:
  processors: [batch]   # ← no resource/cluster

Spark/EMR metrics arriving via OTLP won't carry a cluster attribute, so the cluster variable in those dashboards will silently exclude them. If the ad hoc filter panel is expected to work across logs too, the log pipelines need the processor as well.


TemplateService.CLUSTER_NAME not updated

TemplateService.buildContextVariables() still returns:

"CLUSTER_NAME" to (state.initConfig?.name ?: "cluster"),

This value ends up in __CLUSTER_NAME__ substitutions — notably pyroscope/config.alloy — so Pyroscope profiles will carry the old bare-name cluster label rather than name-clusterId. Profiles won't align with the cluster selector in multi-cluster views. Even if Pyroscope isn't the focus of this PR, it's worth updating for consistency.


Silent ClickHouse datasource removal

Removing the ClickHouse datasource from GrafanaDatasourceConfig means the ClickHouse dashboards will have a broken datasource on grafana update-config. Anyone running ClickHouse clusters will see red panels with no explanation. Either:

  • Document the manual steps to re-add it, or
  • Add an isOptional / conditional path so it's only provisioned when ClickHouse nodes exist in the cluster state

Minor: cassandra-condensed.json removed without a deprecation note

The enum entry and JSON file are gone cleanly, but if there are existing Grafana instances with this dashboard already provisioned (old ConfigMap still present), the volume mount disappears and Grafana logs warnings. Not a blocker, but worth a note in the changelog or upgrade guide.


Positives

  • clusterLabelName() is a clean, well-placed abstraction — calling it name-clusterId matches the existing clusterPrefix() convention.
  • Using the cluster-config ConfigMap to inject CLUSTER_NAME at runtime into both the OTel collector and the stress sidecar is the right K8s pattern.
  • httpMethod: POST on VictoriaMetrics datasource is a good default — avoids URL-length issues with complex PromQL.
  • The cluster comparison dashboard follows the same panel structure as the existing Cassandra overview, making it easy to reason about.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

PR Review: Multi-cluster dashboards

Good, well-scoped change. The OpenSpec-driven approach with clear design docs is solid. A few things worth looking at:


OTel: resource/cluster only stamps local metrics

In otel-collector-config.yaml, resource/cluster is added to metrics/local but not metrics/otlp. Spark/EMR metrics sent via OTLP won't carry the cluster label. If the cluster comparison dashboard is intentionally Cassandra/host-metrics-only this is fine, but worth a comment in the YAML to make the intent explicit.


Potential double-stamping of cluster on prometheus metrics

The prometheus scrape configs already set cluster via relabeling (replacement: '${env:CLUSTER_NAME}'). The new resource/cluster processor also inserts cluster as a resource attribute, which resource_to_telemetry_conversion: enabled: true promotes to a metric label on export. For hostmetrics this is the right path (no existing label). For prometheus-scraped metrics the relabeling already handles it, making the resource processor redundant for those metrics. OTel metric attributes take precedence over resource-derived ones, so no errors expected, but worth a quick check in VictoriaMetrics to confirm no duplicate label issues.


SetupInstance.kt behavioral change

clusterName changes from clusterState.name to clusterState.clusterLabelName() (name-clusterId). This is consistent with GrafanaUpdateConfig now writing the same format to the cluster-config ConfigMap, so both OTel paths agree. For existing clusters, if only one of the two commands is re-run, the cluster label will be inconsistent between metrics sources until both are run. Minor, but worth a note in upgrade docs.


Positive notes

  • clusterLabelName() is clean and well-documented, follows the existing pattern in ClusterStateExtensions.kt. The separation from metricsConfigId() (which has the edl- prefix and length cap) is intentional and correct.
  • Removing the dead ClickHouse native datasource is good cleanup.
  • jsonData = mapOf("httpMethod" to "POST") on VictoriaMetrics is needed for label_values() queries — good catch.
  • Consolidating CASSANDRA_CONDENSED into CASSANDRA_OVERVIEW reduces duplication cleanly.
  • CLUSTER_COMPARISON with optional = true is the right call.

…luster comparison dashboard

- Add cluster multi-select variable and adhocfilters to all 11 dashboards
- Fix adhocfilters to include applyMode and baseFilters required by Grafana 10+
- Add VictoriaMetrics httpMethod=POST for label API compatibility
- Use name-UUID format for cluster metric label to ensure uniqueness across clusters
- Add new cluster-comparison.json dashboard with 8 sections showcasing all panel types:
  Polystat, Table, Time Series, Bar Chart, Status History, Heatmap, Percentile Ladder
- Add stress job section showing client-observed throughput and latency per cluster
- Register GrafanaDashboard.CLUSTER_COMPARISON as optional enum entry
- Remove p98 from read/write latency panels, switch to $__rate_interval
- Drop "All Coordinator Latencies (p99)" panel (wrong metric, redundant)
- Drop "Nodes Status" section (inaccurate)
- Add "Per-Node Latency" section with read/write p99 per host_name
- Add Compaction Throughput panel (Bps) alongside Pending Compactions
- Fix CPU Utilization legend to include cluster name
- Active Tasks converted to full-width stacked area chart (first in section)
- Reorder: Active → Pending → Blocked → Dropped Messages → Hinted Handoff
- Dropped Messages and Hinted Handoff shrunk to w=8 to match 3-panel rows
- Shift Hardware/OS and JVM sections down to avoid overlap
- Drop Disk Space Usage panel (filesystem metrics not available from OTel container)
…to stacked area

- Rename "Request Throughputs (Coordinator Perspective)" row to "Cluster Overview"
- Convert Request Throughputs panel to stacked area (fillOpacity=80, lineWidth=0)
…improvements

- Convert Error Throughputs to stacked area chart
- Remove Read/Write Distribution panel (redundant with stacked throughput)
- Move Pending Compactions into Cluster Overview section
- Move Hardware/OS section to second position for faster triage
- Filter Hardware/OS panels to db nodes only (host_name=~"db.*")
…gend order

- Move Active Tasks by Pool into Cluster Overview section
- Swap legend order to host/pool first, cluster second across all panels
…mpaction throughput

- Add SSTables per Read (p99) time series panel to Data Status section
- Fix Compaction Throughput to show per-host instead of cluster total
@rustyrazorblade rustyrazorblade force-pushed the multi-cluster-dashboards branch from 7433108 to 5d22765 Compare March 31, 2026 03:08
@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

PR Review: Add multi-cluster dashboards

Overall this is a solid set of changes that properly wires multi-cluster awareness end-to-end through the observability stack. A few things worth discussing below.


ClusterStateExtensions.ktclusterLabelName()

The new extension function is clean and follows the existing file patterns. One minor observation: metricsConfigId() trims to Constants.K8S.MAX_LABEL_LENGTH to avoid K8s label truncation issues. clusterLabelName() doesn't. Since this value ends up as a Prometheus label value (not a K8s label key/name), there's no hard limit — but if cluster names are long, the resulting label like my-very-long-cluster-name-<uuid> could be unwieldy in Grafana. Worth a comment acknowledging this isn't subject to the same constraint, or applying a reasonable cap if you expect long names.


SetupInstance.ktCLUSTER_NAME in systemd unit

val clusterName = clusterState.clusterLabelName()  // was: clusterState.name

This writes {name}-{clusterId} into /etc/default/cassandra as CLUSTER_NAME. Stress nodes read this to set cassandra.cluster_name for the driver connection. That value is matched against what the actual Cassandra cluster reports as its name — if they diverge, the driver will refuse to connect.

Is CLUSTER_NAME in stress/Cassandra config used for Cassandra cluster name matching, or just for metric labels? If it feeds into cassandra.cluster_name driver config, this is a latent bug. Worth verifying what the stress tooling does with this variable before merging.


GrafanaUpdateConfig.kt — removed null fallback

// Before
"cluster_name" to (clusterState.initConfig?.name ?: "cluster")
// After
"cluster_name" to clusterState.clusterLabelName()

clusterLabelName() accesses name and clusterId directly on ClusterState. If either can be empty/blank at this point in the call, the label would be malformed (e.g., -<uuid> or name-). The previous code at least had a fallback. This is probably fine in practice since ClusterState should always be populated by the time grafana update-config runs, but it's worth confirming there's no path where name is blank.


GrafanaDatasourceConfig.kt — ClickHouse datasource removed

The ClickHouse datasource is completely gone. The clickhouse-logs.json and clickhouse.json dashboards still exist in the dashboards/ directory and are still registered via GrafanaDashboard (I see them listed in the enum). If those dashboards reference the ClickHouse datasource by UID, they'll show "datasource not found" errors in Grafana.

Was this intentional? If ClickHouse is being phased out, those dashboard JSON files and their enum entries should also be removed. If it's still supported but the datasource registration was accidentally dropped, it should be added back.


otel-collector-config.yamlresource/cluster processor ordering

processors: [resource/cluster, resourcedetection, k8sattributes, batch]

Good placement — runs before resourcedetection so the cluster attribute is set before any overrides could happen. The metrics/otlp pipeline intentionally skips this (OTLP metrics from Spark already carry their labels), which is correct.

One thing: the insert action only adds the attribute if it doesn't already exist. Since this is the authoritative source for local-scraped metrics, upsert might be more appropriate to guarantee consistency even if something upstream sets it unexpectedly. Minor point.


cluster-comparison.json dashboard

The dashboard is focused and well-structured. A couple observations:

  • The $cluster template variable uses query_result with label_values — make sure the VictoriaMetrics query returns the {name}-{uuid} format consistently once this pipeline change is live. The Grafana variable display will show UUIDs which are not user-friendly. Consider using a regex or display name override on the variable to show just the name portion.
  • Fleet Snapshot table panel: thresholds look reasonable (p99 write: orange at 5ms, red at 20ms).

Missing test coverage

No tests were added for clusterLabelName(). Given the function is simple and has no side effects, a test isn't critical, but given the OTel metric label format is now a contract (Grafana queries depend on {name}-{clusterId}), a test documenting the expected format would be valuable insurance against accidental changes.


Summary

clusterLabelName() — no length cap Low risk, but worth a comment
SetupInstanceCLUSTER_NAME format change Needs verification — may break stress node driver connection if used for Cassandra cluster name matching
GrafanaUpdateConfig — removed null fallback Low risk if ClusterState is always populated here
ClickHouse datasource removed while dashboards remain Needs clarification — will cause broken datasource errors in Grafana
OTel insert vs upsert Minor, low impact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant