Metrics for new grafana graphs by jsafy1 · Pull Request #92 · ava-labs/avalanche-indexer

jsafy1 · 2026-02-27T17:17:53Z

Why this should be merged

Support the following requested metrics:

Producer metrics in blockfetcher:
indexer_producer_messages_total{status}
indexer_producer_produce_duration_seconds
indexer_producer_errors_total{type}
Latency from Getting block to publish message to kafka

Retry/failure metrics in sliding-window manager:
indexer_block_retries_total
indexer_block_failures_total{stage}

True Kafka consumer lag to topic high watermark (not just commit-window lag):
indexer_kafka_consumer_group_lag{partition}
Kafka message size 

ConsumerIndexer:
Latency from subscribing message to writing to clickhouse

ClickHouse write metrics in consumerindexer:
indexer_clickhouse_writes_total{table,status}
indexer_clickhouse_write_duration_seconds{table}

How this works

Adds new metrics logging in different stages of our pipeline.

How this was tested

Local setup

Need to be documented in RELEASES.md?

No

Copilot

Pull request overview

This pull request adds comprehensive metrics instrumentation to support new Grafana monitoring dashboards. The changes introduce producer metrics, retry/failure metrics, true Kafka consumer lag tracking, message size tracking, and ClickHouse write metrics across the indexing pipeline.

Changes:

Added producer-side metrics for Kafka message publishing with error classification
Added retry and failure tracking metrics in the sliding window manager with stage-specific counters
Implemented true consumer group lag tracking using Kafka broker watermark offsets
Added Kafka message size histograms for both produced and consumed messages
Added ClickHouse write metrics with per-table duration and status tracking
Refactored metrics nil-checking pattern to rely on receiver nil checks rather than caller checks

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/metrics/metrics.go	Added new metric definitions (producer, retry/failure, consumer lag, message size, ClickHouse writes), NewNoOp() helper, and error classification logic
pkg/slidingwindow/worker/subnet_evm.go	Added producer metrics, block-to-publish latency tracking, message size observation, removed nil checks
pkg/slidingwindow/worker/coreth.go	Added producer metrics, block-to-publish latency tracking, message size observation, removed nil checks, updated comments
pkg/slidingwindow/manager.go	Added stage parameter to handleFailure, integrated retry/failure metrics, removed nil checks
pkg/kafka/offset_manager.go	Added topic tracking to offsetState, implemented recordConsumerGroupLag for true lag metrics, cleanup lag metrics on partition revocation
pkg/kafka/consumer.go	Added message size observation for consumed messages
pkg/kafka/processor/coreth.go	Added ClickHouse write metrics per table, moved processing duration recording to measure full pipeline, updated comments
pkg/metrics/dashboards/avalanche-indexer-metrics-template.json	Added comprehensive Grafana dashboard with panels for all new metrics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: J <github@bouncer.mozmail.com>

jasonatran · 2026-02-28T00:44:13Z

pkg/slidingwindow/worker/subnet_evm.go

-		if cw.metrics != nil {
-			cw.metrics.RecordReceiptFetch(err, receiptDuration.Seconds(), 0)
-		}
+		cw.metrics.RecordReceiptFetch(err, receiptDuration.Seconds(), 0)


Curious, why are we using seconds here instead of ms?

jasonatran · 2026-02-28T01:17:05Z

pkg/kafka/processor/coreth.go

-		if err := p.blocksRepo.WriteBlock(ctx, blockRow); err != nil {
+		writeStart := time.Now()
+		err = p.blocksRepo.WriteBlock(ctx, blockRow)
+		p.metrics.RecordClickHouseWrite(clickHouseTableBlocks, err, time.Since(writeStart).Seconds())


As we grow our metrics plumbing, I am seeing us fall into the old trap we got ourselves in with analytics. It's starting to get confusing how to figure out where metrics functions are. Ideally, we have a shared pkg/metrics which has the most generic reused stuff, including very common labels like chain_id or mainnetOrTestnet. As for things like RecordClickhouseWrite, I think we should be colocating these helper functions. One way to do this would be something like

recordClickHouseWrite(p.metrics, err, time.Since(writeStart).Seconds())

where recordClickhouseWrite() is a private function to this processor pkg.

This will address the scattered constants we're already seeing, such as the clickHouseTable* consts above. If this breaks avalanchego's practice, I would say this is a time where I think avalanchego did it wrong.

Can we make this change?

jasonatran · 2026-02-28T01:18:07Z

pkg/kafka/consumer.go

 // On successful processing, commits offset. On failure, publishes to DLQ (if configured) before committing.
 func (c *Consumer) dispatch(ctx context.Context, msg *cKafka.Message) {
+	if msg != nil {
+		c.metrics.ObserveKafkaMessageSize("consumed", len(msg.Value))


Another example of a naked constant here. Metrics helper functions would be cleaner.

jasonatran · 2026-02-28T01:29:40Z

pkg/kafka/offset_manager.go

+// When no committed offset exists (offset < 0), lag is estimated based on auto.offset.reset:
+// zero for "latest", full partition range for "earliest" or unknown values.
+// Skipped in dryRun mode or when metrics is nil.
+func (om *OffsetManager) recordConsumerGroupLag(dryRun bool) {


If we define the gauge direclty in this package, it'd be easier to track down metrics definitions without having to jump out of this folder. This is clearly a private function which sole purpose is to record group lag. This is quite interesting to implement btw. Typical practice is to build dashboards and take metrics emitted directly from Kafka, but this makes dashboarding much more complicated. I like how you just included it here. However, I think 5 sec intervals is too aggressive and can generate a lot of unnecessary requests to the broker(s). How about every 30-60s?

allenz682 · 2026-02-28T01:38:36Z

pkg/metrics/metrics.go

 		}),
+		producerMessages: prometheus.NewCounterVec(prometheus.CounterOpts{
+			Namespace: Namespace,
+			Subsystem: "producer",


Since we use "producer" multiple places, recommend to move it to a variable for reuse.

allenz682 · 2026-02-28T01:41:29Z

pkg/slidingwindow/manager.go

 		}
 		m.log.Debugw("failed processing block height", "height", h, "error", err)
-		m.handleFailure(h)
+		m.handleFailure(h, "process")


Looks "process" is a label value. Recommended to move it to metrics.go, so we have centralized label values for metrics.

allenz682 · 2026-02-28T01:43:48Z

pkg/kafka/offset_manager.go

 		select {
 		case <-ticker.C:
 			om.commitLatestValidOffsets(dryRun)
+			om.recordConsumerGroupLag(dryRun)


Looks recordConsumerGroupLag() is a blocking call after commitLatestValidOffsets(). Shall we make recordConsumerGroupLag() a separate gorountine to process async?

allenz682 · 2026-02-28T01:45:13Z

pkg/metrics/metrics.go

+		status = StatusError
+	}
+	m.clickHouseWrites.WithLabelValues(table, status).Inc()
+	m.clickHouseWriteDuration.WithLabelValues(table).Observe(durationSeconds)


Do we also want to add status to indicate success/failure for clickHouseWriteDuration?

allenz682 · 2026-02-28T01:47:32Z

pkg/kafka/processor/coreth.go

 )

+const (
+	clickHouseTableBlocks       = "raw_blocks"


We already have "raw_blocks" hardcoded in the flags.go. Recommend to define a variable and reuse it in this PR and also flags.go

allenz682 · 2026-02-28T01:50:51Z

pkg/slidingwindow/worker/coreth.go

 	}

 	cw.log.Debugw("block serialized, producing to kafka", "height", height, "bytes", len(bytes))
+	cw.metrics.ObserveKafkaMessageSize("produced", len(bytes))


Recommend to move "produced" to label variable in metrics.go

allenz682 · 2026-02-28T01:51:35Z

pkg/slidingwindow/manager.go

 		}
 		m.log.Warnw("failed to mark processed", "height", h, "error", err)
-		m.handleFailure(h)
+		m.handleFailure(h, "mark_processed")


Recommend to move "mark_processed" to a constant variable in metrics.go

jsafy1 added 2 commits February 26, 2026 15:02

first pass

b6b1145

patch code

0959bb3

Copilot AI review requested due to automatic review settings February 27, 2026 17:17

jsafy1 requested review from a team, allenz682 and nbrons as code owners February 27, 2026 17:17

Copilot started reviewing on behalf of jsafy1 February 27, 2026 17:18 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

jsafy1 and others added 4 commits February 27, 2026 09:22

Merge branch 'main' into js/new-grafana-graphs

956a5a2

Signed-off-by: J <github@bouncer.mozmail.com>

rm template (separate pr)

4194506

lint fix

2b52e43

Merge branch 'main' into js/new-grafana-graphs

6830f70

jasonatran reviewed Feb 28, 2026

View reviewed changes

allenz682 reviewed Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics for new grafana graphs#92

Metrics for new grafana graphs#92
jsafy1 wants to merge 6 commits intomainfrom
js/new-grafana-graphs

jsafy1 commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jasonatran Feb 28, 2026

Uh oh!

jasonatran Feb 28, 2026

Uh oh!

jasonatran Feb 28, 2026

Uh oh!

jasonatran Feb 28, 2026

Uh oh!

allenz682 Feb 28, 2026

Uh oh!

allenz682 Feb 28, 2026

Uh oh!

allenz682 Feb 28, 2026

Uh oh!

allenz682 Feb 28, 2026

Uh oh!

allenz682 Feb 28, 2026

Uh oh!

allenz682 Feb 28, 2026

Uh oh!

allenz682 Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jsafy1 commented Feb 27, 2026

Why this should be merged

How this works

How this was tested

Need to be documented in RELEASES.md?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants