feat(shard-distributor): record a smoothed per shard load in etcd#7431
feat(shard-distributor): record a smoothed per shard load in etcd#7431Theis-Mathiassen wants to merge 143 commits intocadence-workflow:masterfrom
Conversation
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
… is being reassigned in AssignShard Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…to not overload etcd's 128 max ops per txn Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…s txn and retry monotonically Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ents Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…shard metrics, move out to staging to separate function Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
… And more idiomatic naming of collection vs singular type Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ook more like executor key tests Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ey in BuildShardKey, as we don't use it Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…o "statistics" Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…ollow conventions Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…eartbeat TTL Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…o ewma) Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…t heartbeat Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…rdStatistics Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
…tistics Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go
Show resolved
Hide resolved
service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go
Show resolved
Hide resolved
…cessing events Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go
Outdated
Show resolved
Hide resolved
service/sharddistributor/store/etcd/executorstore/shardcache/namespaceshardcache.go
Show resolved
Hide resolved
Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
…ateChanges could skip statistics update Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
…ssage Signed-off-by: Theis Randeris Mathiassen <theis@tmathiassen.dk>
| return nil | ||
| } | ||
|
|
||
| func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) { |
There was a problem hiding this comment.
⚠️ Bug: Stale statistics not cleared during full cache refresh
applyExecutorData clears shardToExecutor, executorState, executorRevision, and shardOwners maps on full refresh, but does NOT clear executorStatistics.stats. When an executor is removed from etcd, a full refresh (triggered by assignment/metadata changes) will rebuild all other maps from scratch, but stale statistics entries for removed executors will persist in the cache.
This means GetExecutorStatistics can return data for executors that no longer exist, potentially causing incorrect load calculations when the shard distributor uses these stale values for assignment decisions.
The deleteStatistics path only runs for watch events (individual deletes), not during full refresh. A simple fix is to clear the statistics map before re-populating it, similar to how the other maps are handled.
Suggested fix:
func (n *namespaceShardToExecutor) applyExecutorData(data map[string]executorData) {
n.Lock()
defer n.Unlock()
// Clear the cache
n.shardToExecutor = make(map[string]*store.ShardOwner)
n.executorState = make(map[*store.ShardOwner][]string)
n.executorRevision = make(map[string]int64)
n.shardOwners = make(map[string]*store.ShardOwner)
// Clear statistics to remove stale entries for deleted executors
n.executorStatistics.lock.Lock()
n.executorStatistics.stats = make(map[string]map[string]etcdtypes.ShardStatistics)
n.executorStatistics.lock.Unlock()
Was this helpful? React with 👍 / 👎
Code Review
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
What changed?
We added functionality to record the load as a moving average for each shard, where the weight of a new data point depends on how recently the average was last updated.
Furthermore it is now using cache, to reduce the read intensity on etcd.
Why?
This is done to smooth the load input for the shard distributor, this is desirable as the load can change sporadically.
It is also necessary to save the load of each shard in ETCD, as to persist it (In case the handler crashes) and make it available to each instance of shard distributors.
How did you test it?
We have created some unit tests, and tried to run it with the canary service:
TestRecordHeartbeatUpdatesShardStatistics:
This test verifies that when an executor sends a heartbeat with ShardLoad information for a shard, the ShardStatistics for that shard are correctly updated in the store, specifically the SmoothedLoad and LastUpdateTime. It also ensures that LastMoveTime remains unchanged if not explicitly updated.
TestRecordHeartbeatSkipsShardStatisticsWithNilReport:
This test ensures that if an executor's heartbeat includes a nil ShardStatusReport for a particular shard, the existing ShardStatistics for that shard are not updated or created. It also verifies that valid shard reports are processed correctly.
These tests were run with the command:
ETCD=1 go test ./service/sharddistributor/...Potential risks
None, since it is for the shard distributor, which is not utilized in production yet.
Release notes
It is not, since it is for the shard distributor, which is not utilized in production yet.
Documentation Changes
No, but maybe some documentation should be created, later.