You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
kvserver: allow logs from callbacks up to 15 replicas per updateReplicationGauges
Previously, logs from the decommission nudger were not gated by a vmodule and
could become spammy when many replicas were decommissioned at a low nudger
frequency. This commit introduces a per-store budget, allowing logs from
callbacks for up to 15 replicas per updateReplicationGauges call.
Drawbacks of this approach:
- Replicas are not visited in a sorted order, so we may be opening the floodgates
from 15 different replicas each iteration.
- Once a replica is permitted to log, its future logs from callbacks are not
restricted.
- If EnqueueProblemRangeInReplicateQueueInterval is set too low, 1 and 2 may
become worse.
For 1, we could consider visit the replica set with WithReplicasInOrder. I'm not
sure about the overhead here since updateReplicationGauges is called
periodically when collecting metrics.
Here are the reasons that I think this approach is acceptable for now:
- onEnqueueResult is unlikely to be reinvoked for replicas already in the queue
unless they are processing or in purgatory (both are short-lived states we want
visibility into). Once processed, replicas are removed from the set.
onProcessResult should be called at most twice. For replicas merely waiting in
the queue, the callback is not invoked, since their priority should not be
actively updated.
- We could cap logging per maybeEnqueueProblemRange, but granting full logging
permission for each replica simplifies reasoning and gives complete visibility
for specific replias.
- In practice, escalations show that slow decommissioning usually involves <15
ranges, and EnqueueProblemRangeInReplicateQueueInterval is typically large
(~15 minutes).
0 commit comments