Skip to content

chore(shard-manager): Add metrics to track etcd watch events#7586

Open
gazi-yestemirova wants to merge 10 commits intocadence-workflow:masterfrom
gazi-yestemirova:watch_metrics
Open

chore(shard-manager): Add metrics to track etcd watch events#7586
gazi-yestemirova wants to merge 10 commits intocadence-workflow:masterfrom
gazi-yestemirova:watch_metrics

Conversation

@gazi-yestemirova
Copy link
Contributor

@gazi-yestemirova gazi-yestemirova commented Jan 12, 2026

What changed?
This PR adds metrics to etcd watch handlers in shard distributor, to track watch event throughput, processing latency, and consumer lag.

Why?
Shard distributor uses etcd watch streams to detect changes in executor state and trigger cache refreshes or rebalancing. Previously, we had no observability on these watch events, making it difficult to:

  • Detect if watches are falling behind etcd updates,
  • Measure processing performance
  • Alert on watch-related issues before they impact the system

How did you test it?
Started local Cadence with Prometheus, triggered watch events, verified shard_distributor_watch_* metrics appear in /metrics endpoint

Potential risks
N/A

Release notes
N/A

Documentation Changes
N/A

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
@gitar-bot
Copy link

gitar-bot bot commented Feb 17, 2026

🔍 CI failure analysis for dc19d97: Integration test runtime panic at etcdstore.go:283 - nil metricsClient. The test file namespaceshardcache_test.go was fixed, but etcdstore_test.go's createStore() helper still needs MetricsClient initialization.

Issue: Integration test runtime panic - nil metricsClient ✅ REQUIRES FIX

Affects: Golang integration test with etcd (63887309152)

Status: Test compilation errors in namespaceshardcache_test.go have been successfully fixed ✅ in commit dc19d97. However, a runtime panic now occurs during test execution.

Root Cause: The createStore() helper function in etcdstore_test.go doesn't initialize the MetricsClient field, leaving s.metricsClient as nil. When SubscribeToExecutorStatusChanges() is called during tests, it tries to access s.metricsClient.Scope() at line 283, causing a nil pointer dereference.

Panic from CI:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x13220d5]

goroutine 163 [running]:
github.com/uber/cadence/service/sharddistributor/store/etcd/executorstore.(*executorStoreImpl).SubscribeToExecutorStatusChanges.func1()
	/cadence/service/sharddistributor/store/etcd/executorstore/etcdstore.go:283 +0x115

FAIL	github.com/uber/cadence/service/sharddistributor/store/etcd/executorstore	0.182s

Location: Line 283 in etcdstore.go:

metricsScope := s.metricsClient.Scope(metrics.ShardDistributorWatchScope, metrics.NamespaceTag(namespace))

Test Success: The namespaceshardcache_test.go tests pass successfully:

ok  	github.com/uber/cadence/service/sharddistributor/store/etcd/executorstore/shardcache	1.468s	coverage: 81.9% of statements

This confirms that the fixes in commit dc19d97 correctly resolved the compilation errors. The remaining issue is isolated to etcdstore_test.go.

Required Fix

Update etcdstore_test.go's createStore() helper:

Add MetricsClient to the ExecutorStoreParams struct:

func createStore(t *testing.T, tc *testhelper.StoreTestCluster) store.Store {
	t.Helper()

	etcdConfig, err := NewETCDConfig(tc.LeaderCfg)
	require.NoError(t, err)

	store, err := NewStore(ExecutorStoreParams{
		Client:        tc.Client,
		ETCDConfig:    etcdConfig,
		Lifecycle:     fxtest.NewLifecycle(t),
		Logger:        testlogger.New(t),
		MetricsClient: metrics.NewNoopMetricsClient(), // ADD THIS LINE
	})
	require.NoError(t, err)
	return store
}

Progress Update

Fixed in dc19d97:

  • Test compilation errors in namespaceshardcache_test.go (5 locations)
  • Consumer lag metric calculation bug (both watch locations)
  • Tests in namespaceshardcache_test.go now pass

Still needs fix:

  • etcdstore_test.go's createStore() helper missing MetricsClient initialization
  • This is the last remaining test issue blocking integration tests
Code Review ✅ Approved 1 resolved / 1 findings

Clean metrics instrumentation PR. The previous consumer lag finding has been addressed — the metric now correctly uses Header.Revision - lastEvent.Kv.ModRevision to measure actual staleness of processed events rather than inter-response gap.

✅ 1 resolved
Bug: Consumer lag metric measures inter-response gap, not actual lag

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:302-304 📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:207-209
The consumer lag calculation was changed from the correct formula to one that measures the wrong thing.

Previous (correct) formula:

lag := watchResp.Header.Revision - lastEvent.Kv.ModRevision

This measures the true consumer lag: Header.Revision is the current etcd cluster revision, and lastEvent.Kv.ModRevision is the revision at which the most recent event was created. The gap reflects how far behind the consumer is from the current cluster state.

New (incorrect) formula:

lag := watchResp.Header.Revision - lastProcessedRevision

Where lastProcessedRevision is set to watchResp.Header.Revision from the previous response. This measures the revision distance between consecutive watch responses, not how far behind the consumer is.

Impact:

  • A fully caught-up consumer will still report non-zero lag (the number of revisions between consecutive deliveries), creating noise
  • The metric no longer tells operators how far behind the watch stream is from the current etcd state
  • Alerts based on this metric will be unreliable

Suggested fix: Restore the original lag calculation:

if len(watchResp.Events) > 0 {
    lastEvent := watchResp.Events[len(watchResp.Events)-1]
    scope.UpdateGauge(metrics.ShardDistributorWatchConsumerLag, float64(watchResp.Header.Revision - lastEvent.Kv.ModRevision))
}
Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: PR description includes all required sections with substantive content; testing section describes concrete setup, actions, and verification

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant