feat(shard-distributor): record a smoothed per shard load in etcd#7431

Open

Theis-Mathiassen wants to merge 143 commits intocadence-workflow:masterfrom

AndreasHolt:heartbeat-shard-statistics

Theis-Mathiassen commented Nov 11, 2025 •

edited

Loading

What changed?
We added functionality to record the load as a moving average for each shard, where the weight of a new data point depends on how recently the average was last updated.

Furthermore it is now using cache, to reduce the read intensity on etcd.

Why?
This is done to smooth the load input for the shard distributor, this is desirable as the load can change sporadically.
It is also necessary to save the load of each shard in ETCD, as to persist it (In case the handler crashes) and make it available to each instance of shard distributors.

How did you test it?
We have created some unit tests, and tried to run it with the canary service:
TestRecordHeartbeatUpdatesShardStatistics:
This test verifies that when an executor sends a heartbeat with ShardLoad information for a shard, the ShardStatistics for that shard are correctly updated in the store, specifically the SmoothedLoad and LastUpdateTime. It also ensures that LastMoveTime remains unchanged if not explicitly updated.

TestRecordHeartbeatSkipsShardStatisticsWithNilReport:
This test ensures that if an executor's heartbeat includes a nil ShardStatusReport for a particular shard, the existing ShardStatistics for that shard are not updated or created. It also verifies that valid shard reports are processed correctly.

These tests were run with the command: ETCD=1 go test ./service/sharddistributor/...

Potential risks
None, since it is for the shard distributor, which is not utilized in production yet.

Release notes
It is not, since it is for the shard distributor, which is not utilized in production yet.

Documentation Changes
No, but maybe some documentation should be created, later.

AndreasHolt and others added 27 commits

October 20, 2025 14:05


          feat(shard distributor): add shard key helpers and metrics state

2de12d8

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          feat(shard distributor): persist shard metrics in etcd store

5d95067

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): update LastMoveTime in the case where a shard…

6e57536

… is being reassigned in AssignShard

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          test(shard distributor): add tests for shard metrics

595d320

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): modify comment

d9ba54d

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): add atomic check to prevent metrics race

32d2ecd

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): apply shard metric updates in a second phase …

b624a00

…to not overload etcd's 128 max ops per txn

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          feat(shard distributor): move shard metric updates out of AssignShard…

aad7b2e

…s txn and retry monotonically

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): keep NamespaceState revisions tied to assignm…

6360f8a

…ents

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          refactor(shard distributor): use shard cache and clock for preparing …

1536d0a

…shard metrics, move out to staging to separate function

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          test(shard distributor): BuildShardPrefix, BuildShardKey, ParseShardKey

f316fbf

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          feat(shard distributor): simplify shard metrics updates

4524da9

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          refactor(shard distributor): ShardMetrics renamed to ShardStatistics.…

126f725

… And more idiomatic naming of collection vs singular type

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          test(shard distributor): small changes to shard key tests s.t. they l…

cc53f68

…ook more like executor key tests

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): no longer check for key type ShardStatisticsK…

733bbcb

…ey in BuildShardKey, as we don't use it

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          refactor(shard distributor): found a place where I forgot to rename t…

6816b8e

…o "statistics"

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): move non-exported helpers to end of file to f…

f97e0cf

…ollow conventions

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          feat(shard distributor): clean up the shard statistics

513e88c

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          test(shard distributor): add test case for when shard stats are deleted

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): add mapping (new metric)

0332fe5

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          feat(shard distributor): retain shard stats while shards are within h…

d5a13d9

…eartbeat TTL

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          feat: function to update shard statistics from heartbeat (currently n…

634bc02

…o ewma)

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          test(shard distributor): add tests to verify statistics are updated a…

812e854

…t heartbeat

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          feat(shard distributor): calculate smoothed load (ewma) using the Sha…

b9813e7

…rdStatistics

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          fix(shard distributor): log invalid shard load

dfb7448

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          chore: added logger warning and simplified ewma calculation

36ec08f

Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>


          Merge branch 'master' into heartbeat-shard-statistics

38a6e81

Theis-Mathiassen requested review from Shaddoll, davidporter-id-au and neil-xie as code owners

November 11, 2025 11:35

AndreasHolt and others added 4 commits

December 18, 2025 15:37


          fix: same fix but for shardcache

4389eab

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          Merge remote-tracking branch 'origin/master' into heartbeat-shard-sta…

c40dac8

…tistics

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>


          Merge branch 'master' into heartbeat-shard-statistics

c6e5cfe


          Merge branch 'master' into heartbeat-shard-statistics

f4fab6b

Theis-Mathiassen requested review from Assem-Uber, abhishekj720, adhityamamallan, arzonus, bowenxia, c-warren, fimanishi, gazi-yestemirova, macrotim, natemort and timl3136 as code owners

February 2, 2026 14:53

gitar-bot bot reviewed

View reviewed changes

service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go Show resolved Hide resolved

gitar-bot bot reviewed

View reviewed changes

service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go Show resolved Hide resolved

Theis-Mathiassen added 2 commits

February 3, 2026 09:59


          fix: test match functionality, and watch error not checked before pro…

a874d8d

…cessing events

Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>


          Merge branch 'master' into heartbeat-shard-statistics

f83952e

gitar-bot bot reviewed

View reviewed changes

service/sharddistributor/store/etcd/executorstore/etcdstore.go Show resolved Hide resolved

gitar-bot bot reviewed

View reviewed changes

service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go Outdated Show resolved Hide resolved

gitar-bot bot reviewed

View reviewed changes

service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go Show resolved Hide resolved

Theis-Mathiassen added 3 commits

February 6, 2026 15:43


          fix: initial write of statistics would fail

8142dab

Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>


          fix: performance holding locks for less time and bug where executorSt…

fc8f9fb

…ateChanges could skip statistics update

Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>


          fix: properly return executor not found without wrapper to satisfy test

4063c59

Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>

gitar-bot bot reviewed

View reviewed changes

service/sharddistributor/store/etcd/executorstore/etcdstore.go Show resolved Hide resolved

Theis-Mathiassen added 2 commits

February 6, 2026 17:04


          fix: correct handling of executor not found, and more useful error me…

…ssage

Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>


          Merge branch 'master' into heartbeat-shard-statistics

gitar-bot bot reviewed

View reviewed changes

service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go

+              	return nil
+              }
+              func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) {

gitar-bot bot Feb 16, 2026

⚠️

Bug: Stale statistics not cleared during full cache refresh

applyExecutorData clears shardToExecutor, executorState, executorRevision, and shardOwners maps on full refresh, but does NOT clear executorStatistics.stats. When an executor is removed from etcd, a full refresh (triggered by assignment/metadata changes) will rebuild all other maps from scratch, but stale statistics entries for removed executors will persist in the cache.

This means GetExecutorStatistics can return data for executors that no longer exist, potentially causing incorrect load calculations when the shard distributor uses these stale values for assignment decisions.

The deleteStatistics path only runs for watch events (individual deletes), not during full refresh. A simple fix is to clear the statistics map before re-populating it, similar to how the other maps are handled.

Suggested fix:

func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) {
	n.Lock()
	defer n.Unlock()

	// Clear the cache
	n.shardToExecutor = make(map[string]*store.ShardOwner)
	n.executorState = make(map[*store.ShardOwner][]string)
	n.executorRevision = make(map[string]int64)
	n.shardOwners = make(map[string]*store.ShardOwner)

	// Clear statistics to remove stale entries for deleted executors
	n.executorStatistics.lock.Lock()
	n.executorStatistics.stats = make(map[string]map[string]etcdtypes.ShardStatistics)
	n.executorStatistics.lock.Unlock()

_{Was this helpful? React with 👍 / 👎}

gitar-bot bot commented Feb 16, 2026

Code Review ⚠️ Changes requested 8 resolved / 11 findings

Smoothed load feature is well-structured with good test coverage, but applyExecutorData doesn't clear stale statistics on full refresh, which can lead to stale data for removed executors.

⚠️

Bug: Stale statistics not cleared during full cache refresh

📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:394

applyExecutorData clears shardToExecutor, executorState, executorRevision, and shardOwners maps on full refresh, but does NOT clear executorStatistics.stats. When an executor is removed from etcd, a full refresh (triggered by assignment/metadata changes) will rebuild all other maps from scratch, but stale statistics entries for removed executors will persist in the cache.

This means GetExecutorStatistics can return data for executors that no longer exist, potentially causing incorrect load calculations when the shard distributor uses these stale values for assignment decisions.

The deleteStatistics path only runs for watch events (individual deletes), not during full refresh. A simple fix is to clear the statistics map before re-populating it, similar to how the other maps are handled.

Suggested fix

func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) {
	n.Lock()
	defer n.Unlock()

	// Clear the cache
	n.shardToExecutor = make(map[string]*store.ShardOwner)
	n.executorState = make(map[*store.ShardOwner][]string)
	n.executorRevision = make(map[string]int64)
	n.shardOwners = make(map[string]*store.ShardOwner)

	// Clear statistics to remove stale entries for deleted executors
	n.executorStatistics.lock.Lock()
	n.executorStatistics.stats = make(map[string]map[string]etcdtypes.ShardStatistics)
	n.executorStatistics.lock.Unlock()

💡 Bug: RecordHeartbeat writes stats regardless of load balancing mode

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:145

RecordHeartbeat unconditionally calls calcUpdatedStatistics and applyShardStatisticsUpdates on every heartbeat, while other callers like AssignShards (line 399) and GetState (line 308) gate statistics operations behind s.cfg.GetLoadBalancingMode(namespace) == types.LoadBalancingModeGREEDY.

This means that in NAIVE mode, every heartbeat will:

Read executor statistics from the cache (triggering etcd reads on cache miss)
Calculate smoothed load values
Write statistics back to etcd

This creates unnecessary etcd I/O for namespaces that don't use load-based balancing. Consider adding the same mode check as used in AssignShards.

💡 Edge Case: CalculateSmoothedLoad doesn't sanitize prev, allowing NaN propagation

📄 service/sharddistributor/statistics/stats.go:8

The CalculateSmoothedLoad function sanitizes current for NaN/Inf values (lines 9-11), but does not sanitize prev. If a previously stored SmoothedLoad value was corrupted to NaN (e.g., from a serialization bug or manual etcd edit), the formula (1-alpha)*prev + alpha*current would produce NaN, which would then be persisted, permanently poisoning all future calculations for that shard.

Since this is persisted data read from etcd, defensive sanitization of prev would prevent NaN from becoming a permanent, irrecoverable state.

Suggested fix: Add the same NaN/Inf check for prev:

if math.IsNaN(prev) || math.IsInf(prev, 0) {
    prev = 0
}

✅ 8 resolved

✅ Bug: prepareShardStatisticsUpdates fails for new executors with no stats

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:890
When a shard is being assigned to a new executor that has never had statistics stored in etcd, GetExecutorStatistics returns store.ErrExecutorNotFound. This error is not handled at line 892-895 — it propagates up and causes prepareShardStatisticsUpdates to fail entirely, preventing the shard assignment.

The same issue exists at line 866-869 for the old owner's statistics lookup — if the old owner's statistics were deleted from etcd but the shard cache still references that executor as the owner, this will also fail.

Suggested fix: Handle ErrExecutorNotFound the same way calcUpdatedStatistics does — initialize with an empty map:
newOwnerStats, err = s.shardCache.GetExecutorStatistics(ctx, namespace, executorID)
if err != nil {
    if errors.Is(err, store.ErrExecutorNotFound) {
        newOwnerStats = make(map[string]etcdtypes.ShardStatistics)
    } else {
        return nil, err
    }
}
Apply the same pattern for the old owner stats lookup at line 866.

✅ Quality: GetExecutorStatistics returns generic error instead of typed

📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:185 📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:187
In GetExecutorStatistics (namespaceshardcache.go, line 185), when the executor is not found even after a refresh, the function returns:
return nil, fmt.Errorf("could not get executor statistics, even after refresh")
This is an untyped/generic error string. The refreshExecutorStatisticsCache method returns store.ErrExecutorNotFound when the executor doesn't exist in etcd, but that error gets wrapped with "error from refresh: %w" on line 177. Meanwhile, the fallthrough case on line 185 creates a brand new error string with no sentinel error, making it impossible for callers to distinguish "not found" from other failures without string matching.

Suggested fix: Return store.ErrExecutorNotFound (or wrap it) on line 185 as well, so callers can use errors.Is() consistently:
return nil, fmt.Errorf("executor statistics not found after refresh: %w", store.ErrExecutorNotFound)

✅ Bug: calcUpdatedStatistics fails for new executors with no prior stats

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:162 📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:222
When RecordHeartbeat is called for an executor that has never had shard statistics persisted to etcd (e.g., a brand-new executor sending its first heartbeat with load data), calcUpdatedStatistics calls s.shardCache.GetExecutorStatistics(), which ultimately calls refreshExecutorStatisticsCache. Since there's no statistics key in etcd for this executor, refreshExecutorStatisticsCache returns store.ErrExecutorNotFound (line 223 of namespaceshardcache.go). This error propagates up, causing the entire RecordHeartbeat call to fail.

This means the first heartbeat with shard load data for any new executor will error out instead of initializing statistics from scratch.

Impact: New executors cannot report shard load until statistics are pre-seeded by another code path (e.g., shard assignment via prepareShardStatisticsUpdates). If heartbeats with load data arrive before shard assignment, they will fail.

Suggested fix: In calcUpdatedStatistics, handle the ErrExecutorNotFound case from GetExecutorStatistics by using an empty map[string]etcdtypes.ShardStatistics{} instead of propagating the error. This allows the smoothing to initialize correctly with no prior history:
oldStats, err := s.shardCache.GetExecutorStatistics(ctx, namespace, executorID)
if err != nil {
    if errors.Is(err, store.ErrExecutorNotFound) {
        oldStats = make(map[string]etcdtypes.ShardStatistics)
    } else {
        return nil, err
    }
}
Note: TestRecordHeartbeatSkipsShardStatisticsWithNilReport expects RecordHeartbeat to succeed for a brand-new executor with no pre-seeded stats, which contradicts the current implementation and suggests this test would fail.

✅ Performance: refreshExecutorStatisticsCache holds write lock during etcd I/O

📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:203
refreshExecutorStatisticsCache acquires n.executorStatistics.lock.Lock() (a write lock) at the very start of the method (line 203-204), then makes a network call to etcd (n.client.Get, line 211). This means the write lock is held for the entire duration of the etcd round-trip.

While the lock is held, all other goroutines trying to read statistics (via getStats which takes an RLock) or write statistics (via assignStatistics, deleteStatistics, or applyExecutorData) will be blocked. If etcd is slow or under load, this could cause significant contention and latency spikes across the system.

Suggested fix: Use a double-check locking pattern — first check under RLock, then fetch from etcd without holding any lock, then acquire the write lock only to update the map:
func (n *namespaceShardToExecutor) refreshExecutorStatisticsCache(ctx context.Context, executorID string) error {
    // Fetch from etcd without holding the lock
    statsKey := etcdkeys.BuildExecutorKey(n.etcdPrefix, n.namespace, executorID, etcdkeys.ExecutorShardStatisticsKey)
    resp, err := n.client.Get(ctx, statsKey)
    if err != nil {
        return fmt.Errorf("get executor shard statistics: %w", err)
    }

    n.executorStatistics.lock.Lock()
    defer n.executorStatistics.lock.Unlock()

    // Double-check: another goroutine may have populated it while we were fetching
    if _, ok := n.executorStatistics.stats[executorID]; ok {
        return nil
    }

    if len(resp.Kvs) > 0 {
        stats := make(map[string]etcdtypes.ShardStatistics)
        if err := common.DecompressAndUnmarshal(resp.Kvs[0].Value, &stats); err != nil {
            return fmt.Errorf("parse executor shard statistics: %w", err)
        }
        n.executorStatistics.stats[executorID] = stats
    } else {
        return store.ErrExecutorNotFound
    }
    return nil
}
This way, the lock is only held briefly for the map update, not during the network call.

✅ Quality: Return nil explicitly instead of stale err variable

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:180
On line 180 of etcdstore.go, calcUpdatedStatistics returns the err variable from the earlier GetExecutorStatistics call:
return []shardStatisticsUpdate{statsUpdate}, err
At this point err is guaranteed to be nil (the non-nil case returns early on line 164). This is fragile — if code is later added between lines 165 and 179 that reassigns err, the wrong error could be returned silently. Return nil explicitly for clarity:
return []shardStatisticsUpdate{statsUpdate}, nil

...and 3 more resolved from earlier reviews

Rules ✅ All requirements met

Repository Rules

✅ PR Description Quality Standards: PR description includes exact test command `ETCD=1 go test ./service/sharddistributor/...`. All required sections present with substantive content explaining problem and solution.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

eleonoradgr eleonoradgr left review comments

gitar-bot[bot] gitar-bot[bot] left review comments

dkrotx dkrotx approved these changes

Shaddoll Awaiting requested review from Shaddoll Shaddoll is a code owner

neil-xie Awaiting requested review from neil-xie neil-xie is a code owner

davidporter-id-au Awaiting requested review from davidporter-id-au davidporter-id-au is a code owner

shijiesheng Awaiting requested review from shijiesheng shijiesheng is a code owner

jakobht Awaiting requested review from jakobht jakobht is a code owner

3vilhamster Awaiting requested review from 3vilhamster

sankari165 Awaiting requested review from sankari165 sankari165 is a code owner

taylanisikdemir Awaiting requested review from taylanisikdemir

demirkayaender Awaiting requested review from demirkayaender demirkayaender is a code owner

adhityamamallan Awaiting requested review from adhityamamallan adhityamamallan is a code owner

macrotim Awaiting requested review from macrotim macrotim is a code owner

Assem-Uber Awaiting requested review from Assem-Uber Assem-Uber is a code owner

abhishekj720 Awaiting requested review from abhishekj720 abhishekj720 is a code owner

fimanishi Awaiting requested review from fimanishi fimanishi is a code owner

bowenxia Awaiting requested review from bowenxia bowenxia is a code owner

timl3136 Awaiting requested review from timl3136 timl3136 is a code owner

gazi-yestemirova Awaiting requested review from gazi-yestemirova gazi-yestemirova is a code owner

arzonus Awaiting requested review from arzonus arzonus is a code owner

c-warren Awaiting requested review from c-warren c-warren is a code owner

natemort Awaiting requested review from natemort natemort is a code owner

+1 more reviewer

AndreasHolt AndreasHolt left review comments

Labels

None yet