Skip to content

feat(shard-distributor): record a smoothed per shard load in etcd#7431

Open
Theis-Mathiassen wants to merge 143 commits intocadence-workflow:masterfrom
AndreasHolt:heartbeat-shard-statistics
Open

feat(shard-distributor): record a smoothed per shard load in etcd#7431
Theis-Mathiassen wants to merge 143 commits intocadence-workflow:masterfrom
AndreasHolt:heartbeat-shard-statistics

Conversation

@Theis-Mathiassen
Copy link

@Theis-Mathiassen Theis-Mathiassen commented Nov 11, 2025

What changed?
We added functionality to record the load as a moving average for each shard, where the weight of a new data point depends on how recently the average was last updated.

Furthermore it is now using cache, to reduce the read intensity on etcd.

Why?
This is done to smooth the load input for the shard distributor, this is desirable as the load can change sporadically.
It is also necessary to save the load of each shard in ETCD, as to persist it (In case the handler crashes) and make it available to each instance of shard distributors.

How did you test it?
We have created some unit tests, and tried to run it with the canary service:
TestRecordHeartbeatUpdatesShardStatistics:
This test verifies that when an executor sends a heartbeat with ShardLoad information for a shard, the ShardStatistics for that shard are correctly updated in the store, specifically the SmoothedLoad and LastUpdateTime. It also ensures that LastMoveTime remains unchanged if not explicitly updated.

TestRecordHeartbeatSkipsShardStatisticsWithNilReport:
This test ensures that if an executor's heartbeat includes a nil ShardStatusReport for a particular shard, the existing ShardStatistics for that shard are not updated or created. It also verifies that valid shard reports are processed correctly.

These tests were run with the command: ETCD=1 go test ./service/sharddistributor/...

Potential risks
None, since it is for the shard distributor, which is not utilized in production yet.

Release notes
It is not, since it is for the shard distributor, which is not utilized in production yet.

Documentation Changes
No, but maybe some documentation should be created, later.

AndreasHolt and others added 27 commits October 20, 2025 14:05
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
… is being reassigned in AssignShard

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…to not overload etcd's 128 max ops per txn

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…s txn and retry monotonically

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ents

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…shard metrics, move out to staging to separate function

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
… And more idiomatic naming of collection vs singular type

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ook more like executor key tests

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ey in BuildShardKey, as we don't use it

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…o "statistics"

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ollow conventions

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…eartbeat TTL

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…o ewma)

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…t heartbeat

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…rdStatistics

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
AndreasHolt and others added 4 commits December 18, 2025 15:37
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…tistics

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
…ateChanges could skip statistics update

Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
return nil
}

func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Stale statistics not cleared during full cache refresh

applyExecutorData clears shardToExecutor, executorState, executorRevision, and shardOwners maps on full refresh, but does NOT clear executorStatistics.stats. When an executor is removed from etcd, a full refresh (triggered by assignment/metadata changes) will rebuild all other maps from scratch, but stale statistics entries for removed executors will persist in the cache.

This means GetExecutorStatistics can return data for executors that no longer exist, potentially causing incorrect load calculations when the shard distributor uses these stale values for assignment decisions.

The deleteStatistics path only runs for watch events (individual deletes), not during full refresh. A simple fix is to clear the statistics map before re-populating it, similar to how the other maps are handled.

Suggested fix:

func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) {
	n.Lock()
	defer n.Unlock()

	// Clear the cache
	n.shardToExecutor = make(map[string]*store.ShardOwner)
	n.executorState = make(map[*store.ShardOwner][]string)
	n.executorRevision = make(map[string]int64)
	n.shardOwners = make(map[string]*store.ShardOwner)

	// Clear statistics to remove stale entries for deleted executors
	n.executorStatistics.lock.Lock()
	n.executorStatistics.stats = make(map[string]map[string]etcdtypes.ShardStatistics)
	n.executorStatistics.lock.Unlock()


Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link

gitar-bot bot commented Feb 16, 2026

Code Review ⚠️ Changes requested 8 resolved / 11 findings

Smoothed load feature is well-structured with good test coverage, but applyExecutorData doesn't clear stale statistics on full refresh, which can lead to stale data for removed executors.

⚠️ Bug: Stale statistics not cleared during full cache refresh

📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:394

applyExecutorData clears shardToExecutor, executorState, executorRevision, and shardOwners maps on full refresh, but does NOT clear executorStatistics.stats. When an executor is removed from etcd, a full refresh (triggered by assignment/metadata changes) will rebuild all other maps from scratch, but stale statistics entries for removed executors will persist in the cache.

This means GetExecutorStatistics can return data for executors that no longer exist, potentially causing incorrect load calculations when the shard distributor uses these stale values for assignment decisions.

The deleteStatistics path only runs for watch events (individual deletes), not during full refresh. A simple fix is to clear the statistics map before re-populating it, similar to how the other maps are handled.

Suggested fix
func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) {
	n.Lock()
	defer n.Unlock()

	// Clear the cache
	n.shardToExecutor = make(map[string]*store.ShardOwner)
	n.executorState = make(map[*store.ShardOwner][]string)
	n.executorRevision = make(map[string]int64)
	n.shardOwners = make(map[string]*store.ShardOwner)

	// Clear statistics to remove stale entries for deleted executors
	n.executorStatistics.lock.Lock()
	n.executorStatistics.stats = make(map[string]map[string]etcdtypes.ShardStatistics)
	n.executorStatistics.lock.Unlock()


💡 Bug: RecordHeartbeat writes stats regardless of load balancing mode

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:145

RecordHeartbeat unconditionally calls calcUpdatedStatistics and applyShardStatisticsUpdates on every heartbeat, while other callers like AssignShards (line 399) and GetState (line 308) gate statistics operations behind s.cfg.GetLoadBalancingMode(namespace) == types.LoadBalancingModeGREEDY.

This means that in NAIVE mode, every heartbeat will:

  1. Read executor statistics from the cache (triggering etcd reads on cache miss)
  2. Calculate smoothed load values
  3. Write statistics back to etcd

This creates unnecessary etcd I/O for namespaces that don't use load-based balancing. Consider adding the same mode check as used in AssignShards.

💡 Edge Case: CalculateSmoothedLoad doesn't sanitize prev, allowing NaN propagation

📄 service/sharddistributor/statistics/stats.go:8

The CalculateSmoothedLoad function sanitizes current for NaN/Inf values (lines 9-11), but does not sanitize prev. If a previously stored SmoothedLoad value was corrupted to NaN (e.g., from a serialization bug or manual etcd edit), the formula (1-alpha)*prev + alpha*current would produce NaN, which would then be persisted, permanently poisoning all future calculations for that shard.

Since this is persisted data read from etcd, defensive sanitization of prev would prevent NaN from becoming a permanent, irrecoverable state.

Suggested fix: Add the same NaN/Inf check for prev:

if math.IsNaN(prev) || math.IsInf(prev, 0) {
    prev = 0
}
✅ 8 resolved
Bug: prepareShardStatisticsUpdates fails for new executors with no stats

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:890
When a shard is being assigned to a new executor that has never had statistics stored in etcd, GetExecutorStatistics returns store.ErrExecutorNotFound. This error is not handled at line 892-895 — it propagates up and causes prepareShardStatisticsUpdates to fail entirely, preventing the shard assignment.

The same issue exists at line 866-869 for the old owner's statistics lookup — if the old owner's statistics were deleted from etcd but the shard cache still references that executor as the owner, this will also fail.

Suggested fix: Handle ErrExecutorNotFound the same way calcUpdatedStatistics does — initialize with an empty map:

newOwnerStats, err = s.shardCache.GetExecutorStatistics(ctx, namespace, executorID)
if err != nil {
    if errors.Is(err, store.ErrExecutorNotFound) {
        newOwnerStats = make(map[string]etcdtypes.ShardStatistics)
    } else {
        return nil, err
    }
}

Apply the same pattern for the old owner stats lookup at line 866.

Quality: GetExecutorStatistics returns generic error instead of typed

📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:185 📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:187
In GetExecutorStatistics (namespaceshardcache.go, line 185), when the executor is not found even after a refresh, the function returns:

return nil, fmt.Errorf("could not get executor statistics, even after refresh")

This is an untyped/generic error string. The refreshExecutorStatisticsCache method returns store.ErrExecutorNotFound when the executor doesn't exist in etcd, but that error gets wrapped with "error from refresh: %w" on line 177. Meanwhile, the fallthrough case on line 185 creates a brand new error string with no sentinel error, making it impossible for callers to distinguish "not found" from other failures without string matching.

Suggested fix: Return store.ErrExecutorNotFound (or wrap it) on line 185 as well, so callers can use errors.Is() consistently:

return nil, fmt.Errorf("executor statistics not found after refresh: %w", store.ErrExecutorNotFound)
Bug: calcUpdatedStatistics fails for new executors with no prior stats

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:162 📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:222
When RecordHeartbeat is called for an executor that has never had shard statistics persisted to etcd (e.g., a brand-new executor sending its first heartbeat with load data), calcUpdatedStatistics calls s.shardCache.GetExecutorStatistics(), which ultimately calls refreshExecutorStatisticsCache. Since there's no statistics key in etcd for this executor, refreshExecutorStatisticsCache returns store.ErrExecutorNotFound (line 223 of namespaceshardcache.go). This error propagates up, causing the entire RecordHeartbeat call to fail.

This means the first heartbeat with shard load data for any new executor will error out instead of initializing statistics from scratch.

Impact: New executors cannot report shard load until statistics are pre-seeded by another code path (e.g., shard assignment via prepareShardStatisticsUpdates). If heartbeats with load data arrive before shard assignment, they will fail.

Suggested fix: In calcUpdatedStatistics, handle the ErrExecutorNotFound case from GetExecutorStatistics by using an empty map[string]etcdtypes.ShardStatistics{} instead of propagating the error. This allows the smoothing to initialize correctly with no prior history:

oldStats, err := s.shardCache.GetExecutorStatistics(ctx, namespace, executorID)
if err != nil {
    if errors.Is(err, store.ErrExecutorNotFound) {
        oldStats = make(map[string]etcdtypes.ShardStatistics)
    } else {
        return nil, err
    }
}

Note: TestRecordHeartbeatSkipsShardStatisticsWithNilReport expects RecordHeartbeat to succeed for a brand-new executor with no pre-seeded stats, which contradicts the current implementation and suggests this test would fail.

Performance: refreshExecutorStatisticsCache holds write lock during etcd I/O

📄 service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go:203
refreshExecutorStatisticsCache acquires n.executorStatistics.lock.Lock() (a write lock) at the very start of the method (line 203-204), then makes a network call to etcd (n.client.Get, line 211). This means the write lock is held for the entire duration of the etcd round-trip.

While the lock is held, all other goroutines trying to read statistics (via getStats which takes an RLock) or write statistics (via assignStatistics, deleteStatistics, or applyExecutorData) will be blocked. If etcd is slow or under load, this could cause significant contention and latency spikes across the system.

Suggested fix: Use a double-check locking pattern — first check under RLock, then fetch from etcd without holding any lock, then acquire the write lock only to update the map:

func (n *namespaceShardToExecutor) refreshExecutorStatisticsCache(ctx context.Context, executorID string) error {
    // Fetch from etcd without holding the lock
    statsKey := etcdkeys.BuildExecutorKey(n.etcdPrefix, n.namespace, executorID, etcdkeys.ExecutorShardStatisticsKey)
    resp, err := n.client.Get(ctx, statsKey)
    if err != nil {
        return fmt.Errorf("get executor shard statistics: %w", err)
    }

    n.executorStatistics.lock.Lock()
    defer n.executorStatistics.lock.Unlock()

    // Double-check: another goroutine may have populated it while we were fetching
    if _, ok := n.executorStatistics.stats[executorID]; ok {
        return nil
    }

    if len(resp.Kvs) > 0 {
        stats := make(map[string]etcdtypes.ShardStatistics)
        if err := common.DecompressAndUnmarshal(resp.Kvs[0].Value, &stats); err != nil {
            return fmt.Errorf("parse executor shard statistics: %w", err)
        }
        n.executorStatistics.stats[executorID] = stats
    } else {
        return store.ErrExecutorNotFound
    }
    return nil
}

This way, the lock is only held briefly for the map update, not during the network call.

Quality: Return nil explicitly instead of stale err variable

📄 service/sharddistributor/store/etcd/executorstore/etcdstore.go:180
On line 180 of etcdstore.go, calcUpdatedStatistics returns the err variable from the earlier GetExecutorStatistics call:

return []shardStatisticsUpdate{statsUpdate}, err

At this point err is guaranteed to be nil (the non-nil case returns early on line 164). This is fragile — if code is later added between lines 165 and 179 that reassigns err, the wrong error could be returned silently. Return nil explicitly for clarity:

return []shardStatisticsUpdate{statsUpdate}, nil

...and 3 more resolved from earlier reviews

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: PR description includes exact test command `ETCD=1 go test ./service/sharddistributor/...`. All required sections present with substantive content explaining problem and solution.
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants