chore(shard-manager): Add metrics to track etcd watch events#7586
chore(shard-manager): Add metrics to track etcd watch events#7586gazi-yestemirova wants to merge 10 commits intocadence-workflow:masterfrom
Conversation
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
🔍 CI failure analysis for dc19d97: Integration test runtime panic at etcdstore.go:283 - nil metricsClient. The test file namespaceshardcache_test.go was fixed, but etcdstore_test.go's createStore() helper still needs MetricsClient initialization.Issue: Integration test runtime panic - nil metricsClient ✅ REQUIRES FIXAffects: Golang integration test with etcd (63887309152) Status: Test compilation errors in Root Cause: The Panic from CI: Location: Line 283 in metricsScope := s.metricsClient.Scope(metrics.ShardDistributorWatchScope, metrics.NamespaceTag(namespace))Test Success: The This confirms that the fixes in commit dc19d97 correctly resolved the compilation errors. The remaining issue is isolated to Required FixUpdate etcdstore_test.go's createStore() helper: Add func createStore(t *testing.T, tc *testhelper.StoreTestCluster) store.Store {
t.Helper()
etcdConfig, err := NewETCDConfig(tc.LeaderCfg)
require.NoError(t, err)
store, err := NewStore(ExecutorStoreParams{
Client: tc.Client,
ETCDConfig: etcdConfig,
Lifecycle: fxtest.NewLifecycle(t),
Logger: testlogger.New(t),
MetricsClient: metrics.NewNoopMetricsClient(), // ADD THIS LINE
})
require.NoError(t, err)
return store
}Progress Update✅ Fixed in dc19d97:
❌ Still needs fix:
Code Review ✅ Approved 1 resolved / 1 findingsClean metrics instrumentation PR. The previous consumer lag finding has been addressed — the metric now correctly uses ✅ 1 resolved✅ Bug: Consumer lag metric measures inter-response gap, not actual lag
Rules ✅ All requirements metRepository Rules
Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
What changed?
This PR adds metrics to etcd watch handlers in shard distributor, to track watch event throughput, processing latency, and consumer lag.
Why?
Shard distributor uses etcd watch streams to detect changes in executor state and trigger cache refreshes or rebalancing. Previously, we had no observability on these watch events, making it difficult to:
How did you test it?
Started local Cadence with Prometheus, triggered watch events, verified shard_distributor_watch_* metrics appear in /metrics endpoint
Potential risks
N/A
Release notes
N/A
Documentation Changes
N/A