|
| 1 | +# Allocator |
| 2 | + |
| 3 | +### Replica Scanner And Base Queue |
| 4 | + |
| 5 | +- Every store has a replica scanner that periodically scans and iterates over every replica of stores. It calls every base queues’s [`MaybeAddAsync`](https://github.com/cockroachdb/cockroach/blob/308665f4fb5176f8f7dcefdbfe8b9eee669e9f5a/pkg/kv/kvserver/queue.go#L635) with every replica. Every base queue’s `MaybeAdd` then calls into base queue’s `replicaCanBeProcessed`, shouldQueue. If `shouldQueue` returns true, it [adds](https://github.com/cockroachdb/cockroach/blob/308665f4fb5176f8f7dcefdbfe8b9eee669e9f5a/pkg/kv/kvserver/queue.go#L763) the replica to the queue with the priority given by `shouldQueue`. Skips if `shouldQueue` returns false. |
| 6 | +- Every base queue implements `bq.processLoop` which pops items added by the replica scanner and calls into `bq.processOneAsyncAndReleaseSem`. `bq.processOneAsyncAndReleaseSem` then calls `bq.processReplica` async with a time out. `bq.processReplica` then calls `bq.process`. If replicas [fail](https://github.com/cockroachdb/cockroach/blob/308665f4fb5176f8f7dcefdbfe8b9eee669e9f5a/pkg/kv/kvserver/queue.go#L1141-L1143) to be processed, they will be sent to the purgatory queue. |
| 7 | +- Base queue includes: consistency queue, lease queue, merge queue, mvcc gc queue, raft log queue, raft snapshot queue, replica gc queue, replicate queue, split queue, timeseries maintenance queue. |
| 8 | + |
| 9 | + |
| 10 | +## Replicate Queue |
| 11 | +### High Level |
| 12 | +1. `ReplicateQueue.shouldQueue` calls into `ReplicaPlanner.ShouldPlanChange` which decides whether the replica scanner adds the replica to the queue. |
| 13 | +2. Replica scanner adds replicas that need to be queued to the base queue. |
| 14 | +3. Inside `replicateQueue.process` with one replica, a) grab the range token b) calls into `rq.processOneChange`. |
| 15 | +4. `processOneChange` calls into `ReplicaPlanner.PlanOneChange` which plans a replicate change or an error. `PlanOneChange` calls into `allocator.ComputeAction` which returns an enum action. This enum action tells replicate queue to consider add / remove / replace replicas. `ComputeAction` gives an action that should be done. |
| 16 | + The separation is here because `planner.ShouldPlanChange` calls into `ComputeAction` as well, and we want a cheap check for the queue. Note that actions returned from `ComputeAction` in `ShouldPlanChange` v.s. `PlanOneChange` may be different. |
| 17 | +5. Replica planner then switches and computes what replicate changes need to be done based on the action. |
| 18 | + |
| 19 | +### shouldQueue: whether queue replicas to replicate queue |
| 20 | +1. `shouldQueue` takes a replica and calls into [`ReplicaPlanner.ShouldPlanChange`](https://github.com/cockroachdb/cockroach/blob/5c2a22f84f0581df39239bc890b9d7d08c6e020a/pkg/kv/kvserver/allocator/plan/replicate.go#L145) which then calls into the allocator via `rp.allocator.ComputeAction`. |
| 21 | +2. [`ComputeAction`](https://github.com/cockroachdb/cockroach/blob/ad4d1478461f2edfa49d1cadd34cad910f5f5b10/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L913) computes an action that needs to be done on this replica. |
| 22 | ++ It first checks if a repair action is needed: under-replicated, quorum, dead/decommissioning replicas, over-replicated. Note that this does not include constraint repair. |
| 23 | ++ If no repair needs to be done on this range, it would fall back and return `AllocatorConsiderRebalance` by default. |
| 24 | +3. After ComputeAction gives `ReplicaPlanner.ShouldPlanChange` the action, it would return true if a repair action is needed. Otherwise, it checks if a rebalance target can be found via [Allocator.RebalanceTarget](https://github.com/cockroachdb/cockroach/blob/ad4d1478461f2edfa49d1cadd34cad910f5f5b10/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L1784). We will go into the details of `RebalanceTarget` below. |
| 25 | + |
| 26 | +### processOneChange: plan and process a change on the replica |
| 27 | +1. [`replicateQueue.processOneChange`](https://github.com/cockroachdb/cockroach/blob/a96cb747d2043d2959c1d44e29064c57d7aacde8/pkg/kv/kvserver/replicate_queue.go#L941) happens when the base queue pops the replica off the queue and actually start processing the replica. |
| 28 | +2. `replicateQueue.processOneChange` calls into `ReplicaPlanner.PlanOneChange` which then calls into `ComputeAction` as well. Based on the action, it would call into ReplicaPlanner's helper functions (such as [`addOrReplaceVoters`](https://github.com/cockroachdb/cockroach/blob/5c2a22f84f0581df39239bc890b9d7d08c6e020a/pkg/kv/kvserver/allocator/plan/replicate.go#L355)) to compute the exact allocation operation out (such as a lease transfer, change of replicas). |
| 29 | +3. For now, lets focus on how [`considerRebalance`](https://github.com/cockroachdb/cockroach/blob/5c2a22f84f0581df39239bc890b9d7d08c6e020a/pkg/kv/kvserver/allocator/plan/replicate.go#L773). As a reminder, this action is returned by default for every replica when no other repair action applies. Note that this action does help repair zone constraints. |
| 30 | + |
| 31 | +### considerRebalance: find one rebalancing target for any of the existing replica |
| 32 | +1. `ReplicaPlanner` first [picks](https://github.com/cockroachdb/cockroach/blob/5c2a22f84f0581df39239bc890b9d7d08c6e020a/pkg/kv/kvserver/allocator/plan/replicate.go#L789) which scorer options to be used for the allocator decision. It uses the `RangeCountScorerOptions` by default and `ScatterScorerOptions` if this action originates from a scatter operation. [ScorerOptions](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L266) define the heuristics used to evaluate candidate replicas, helping the allocator guide decisions toward specific goals such as range count convergence. |
| 33 | +2. `ReplicaPlanner` then calls into the allocator to find the best rebalancing target via [`Allocator.RebalanceTarget`](https://github.com/cockroachdb/cockroach/blob/ad4d1478461f2edfa49d1cadd34cad910f5f5b10/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L1784). |
| 34 | +3. `Allocator.RebalanceTarget` first [builds](https://github.com/cockroachdb/cockroach/blob/ad4d1478461f2edfa49d1cadd34cad910f5f5b10/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L1818-L1820) constraint checkers, which are passed to rankedCandidateListForRebalancing to determine which existing replicas / stores are valid or required based on the range’s constraints. |
| 35 | +4. `Allocator.RebalanceTarget` then calls into [`rankedCandidateListForRebalancing`](https://github.com/cockroachdb/cockroach/blob/ad4d1478461f2edfa49d1cadd34cad910f5f5b10/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L1848) to find a list of rebalance options. Each rebalance option consists of an existing replica and a set of candidate stores that can replace it. |
| 36 | +- `rankedCandidateListForRebalancing` first [builds](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1513) a map of existing store ids to their corresponding candidate entries - annotated with attributes such as valid, necessary, disk fullness, I/O overload, and diversity score. |
| 37 | +- For each existing replica store, it then constructs an [`equivalence class`](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1579), containing the store itself along with a list of comparable candidates. |
| 38 | + - This is how each equivalence class is constraucted: for every existing replica, all other stores are considered as potential candidates. During this iteration, some [filtering](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1650) is applied based on the state of other existing replicas. If a candidate store is [not worse](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1671) than the existing store, it is added to the list of comparable candidates. From all considered candidates, the best set is selected, and an equivalance class is constructed using existing replica and these best candidates. Note that at this stage, only attributes such as valid, necessary (constraint conformance), diversity score, and disk fullness are considered. At this point, range count convergence has not been considered yet. |
| 39 | + |
| 40 | + - (TODO(wenyihu6 during review): why ioOverloaded is always false here there is a comment https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1550-L1553 but idu). |
| 41 | + |
| 42 | + - (TODO(wenyihu6 during review) looks like at this stage we do not exclude other existing replicas stores - we filter it out below when iterating over comparable candidates but why) |
| 43 | + |
| 44 | +- At this point, we have an equivalence class for every existing replica. We then examine candidates’ range count statistics within each class to help break ties. |
| 45 | + - For each equivalence class, ScorerOptions [populate](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1775) the convergence score, balance score, and range count for existing store and potential candidates. The candidate set is then further filtered to include only those [better](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1792) than the existing store. |
| 46 | +5. Using the results from rankedCandidateListForRebalancing, `Allocator.RebalanceTarget` iterates over all rebalanceOptions to [identify](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1826) the option that offers the largest improvement. As a reminder, each `rebalanceOption` consists of an existing replica and a list of candidate stores. For each option, it selects the best candidate set, randomly [chooses](https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1821) one candidate, and then determines the overall best rebalance option by comparing all rebalanceOptions. |
0 commit comments