@@ -229,19 +229,43 @@ works in parallel with the storage engine.)
229229
230230# Allocation
231231
232- (AllocationService runs on the master node)
233-
234- (Discuss different deciders that limit allocation. Sketch / list the different deciders that we have.)
235-
236- ### APIs for Balancing Operations
237-
238- (Significant internal APIs for balancing a cluster)
239-
240- ### Heuristics for Allocation
241-
242- ### Cluster Reroute Command
243-
244- (How does this command behave with the desired auto balancer.)
232+ ### Core Components
233+
234+ The ` DesiredBalanceShardsAllocator ` is what runs shard allocation decisions. It leverages the ` DesiredBalanceComputer ` to produce
235+ ` DesiredBalance ` instances for the cluster based on the latest cluster changes (add/remove nodes, create/remove indices, load, etc). Then
236+ the ` DesiredBalanceReconciler ` is invoked to choose the next steps to take to move the cluster from the current shard allocation to the
237+ latest computed ` DesiredBalance ` shard allocation. The ` Reconciler ` will apply changes to a copy of the ` RoutingNodes ` , which is then
238+ published in a cluster state update that will reach the data nodes to start the individual shard creation/recovery/move work.
239+
240+ The ` Reconciler ` is throttled by cluster settings, like the max number of concurrent shard moves and recoveries per cluster and node: this
241+ is why the ` Reconciler ` will make, and publish via cluster state updates, incremental changes to the cluster shard allocation. The
242+ ` DesiredBalanceShardsAllocator ` is the endpoint for reroute requests, which may trigger immediate requests to the ` Reconciler ` , but
243+ asynchronous requests to the ` DesiredBalanceComputer ` via the ` ContinuousComputation ` component. Cluster state changes that affect shard
244+ balancing (for example index deletion) all call some reroute method interface that reaches the ` DesiredBalanceShardsAllocator ` to run
245+ reconciliation and queue a request for the ` DesiredBalancerComputer ` , leading to desired balance computation and reconciliation actions.
246+ Asynchronous completion of a new ` DesiredBalance ` will also invoke a reconciliation action, as will cluster state updates completing shard
247+ moves/recoveries (unthrottling the next shard move/recovery).
248+
249+ The ` ContinuousComputation ` maintains a queue of desired balance computation requests, each of which holds the latest cluster information at
250+ the time of the request, and a thread that runs the ` DesiredBalanceComputer ` . The ContinuousComputation thread grabs the latest request,
251+ with the freshest cluster information, feeds it into the ` DesiredBalanceComputer ` and publishes a ` DesiredBalance ` back to the
252+ ` DesiredBalanceShardsAllocator ` to use for reconciliation actions. Sometimes the ` ContinuousComputation ` thread's desired balance
253+ computation will be signalled to end early and get a ` DesiredBalance ` published more quickly, if newer cluster state changes are significant
254+ or to get unassigned shards started as soon as possible.
255+
256+ ### Rebalancing Process
257+
258+ There are different priorities in shard allocation, reflected in which moves the ` DesiredBalancerReconciler ` selects to do first given that
259+ it can only move, recover, or remove a limited number of shards at once. The first priority is assigning unassigned shards, primaries being
260+ more important than replicas. The second is to move shards that violate node resource limits or hard limits. The ` AllocationDeciders ` holds
261+ a group of ` AllocationDecider ` types that place hard constraints on shard allocation. There is a decider that manages disk memory usage
262+ thresholds that does not allow further shard assignment or requires shards to be moved off; or another that excludes a configurable list of
263+ indices from certain nodes; or one that will not attempt to recover a shard on a certain node after so many failed retries. The third
264+ priority is to rebalance shards to even out the relative weight of shards on each node: the intention is to avoid, or ease, future
265+ hot-spotting on data nodes due to too many shards being placed on the same data node. Node shard weight is based on a sum of factors: disk
266+ memory usage, projected shard write load, total number of shards, and an incentive to spread out shards within the same index. See the
267+ ` WeightFunction ` and ` NodeAllocationStatsAndWeightsCalculator ` classes for more details on the weight calculations that support the
268+ ` DesiredBalanceComputer ` decisions.
245269
246270# Autoscaling
247271
0 commit comments