Skip to content

Commit 2a951eb

Browse files
review fixes and some self-review
1 parent a6bc4d9 commit 2a951eb

File tree

1 file changed

+29
-27
lines changed

1 file changed

+29
-27
lines changed

docs/internal/DistributedArchitectureGuide.md

Lines changed: 29 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -232,40 +232,42 @@ works in parallel with the storage engine.)
232232
### Core Components
233233

234234
The `DesiredBalanceShardsAllocator` is what runs shard allocation decisions. It leverages the `DesiredBalanceComputer` to produce
235-
`DesiredBalance` instances for the cluster based on the latest cluster changes (add/remove nodes, create/remove indices, load, etc). Then
235+
`DesiredBalance` instances for the cluster based on the latest cluster changes (add/remove nodes, create/remove indices, load, etc.). Then
236236
the `DesiredBalanceReconciler` is invoked to choose the next steps to take to move the cluster from the current shard allocation to the
237-
latest computed `DesiredBalance` shard allocation. The `Reconciler` will apply changes to a copy of the `RoutingNodes`, which is then
238-
published in a cluster state update that will reach the data nodes to start the individual shard creation/recovery/move work.
239-
240-
The `Reconciler` is throttled by cluster settings, like the max number of concurrent shard moves and recoveries per cluster and node: this
241-
is why the `Reconciler` will make, and publish via cluster state updates, incremental changes to the cluster shard allocation. The
242-
`DesiredBalanceShardsAllocator` is the endpoint for reroute requests, which may trigger immediate requests to the `Reconciler`, but
243-
asynchronous requests to the `DesiredBalanceComputer` via the `ContinuousComputation` component. Cluster state changes that affect shard
244-
balancing (for example index deletion) all call some reroute method interface that reaches the `DesiredBalanceShardsAllocator` to run
245-
reconciliation and queue a request for the `DesiredBalancerComputer`, leading to desired balance computation and reconciliation actions.
246-
Asynchronous completion of a new `DesiredBalance` will also invoke a reconciliation action, as will cluster state updates completing shard
247-
moves/recoveries (unthrottling the next shard move/recovery).
248-
249-
The `ContinuousComputation` maintains a queue of desired balance computation requests, each of which holds the latest cluster information at
250-
the time of the request, and a thread that runs the `DesiredBalanceComputer`. The ContinuousComputation thread grabs the latest request,
251-
with the freshest cluster information, feeds it into the `DesiredBalanceComputer` and publishes a `DesiredBalance` back to the
237+
latest computed `DesiredBalance` shard allocation. The `DesiredBalanceReconciler` will apply changes to a copy of the `RoutingNodes`, which
238+
is then published in a cluster state update that will reach the data nodes to start the individual shard recovery/deletion/move work.
239+
240+
The `DesiredBalanceReconciler` is throttled by cluster settings, like the max number of concurrent shard moves and recoveries per cluster
241+
and node: this is why the `DesiredBalanceReconciler` will make, and publish via cluster state updates, incremental changes to the cluster
242+
shard allocation. The `DesiredBalanceShardsAllocator` is the endpoint for reroute requests, which may trigger immediate requests to the
243+
`DesiredBalanceReconciler`, but asynchronous requests to the `DesiredBalanceComputer` via the `ContinuousComputation` component. Cluster
244+
state changes that affect shard balancing (for example index deletion) all call some reroute method interface that reaches the
245+
`DesiredBalanceShardsAllocator` to run reconciliation and queue a request for the `DesiredBalancerComputer`, leading to desired balance
246+
computation and reconciliation actions. Asynchronous completion of a new `DesiredBalance` will also invoke a reconciliation action, as will
247+
cluster state updates completing shard moves/recoveries (unthrottling the next shard move/recovery).
248+
249+
The `ContinuousComputation` saves the latest desired balance computation request, which holds the cluster information at the time of that
250+
request, and a thread that runs the `DesiredBalanceComputer`. The `ContinuousComputation` thread takes the latest request, with the
251+
associated cluster information, feeds it into the `DesiredBalanceComputer` and publishes a `DesiredBalance` back to the
252252
`DesiredBalanceShardsAllocator` to use for reconciliation actions. Sometimes the `ContinuousComputation` thread's desired balance
253-
computation will be signalled to end early and get a `DesiredBalance` published more quickly, if newer cluster state changes are significant
254-
or to get unassigned shards started as soon as possible.
253+
computation will be signalled to exit early and publish the initial `DesiredBalance` improvements it has made, when newer rebalancing
254+
requests (due to cluster state changes) have arrived, or in order to begin recovery of unassigned shards as quickly as possible.
255255

256256
### Rebalancing Process
257257

258258
There are different priorities in shard allocation, reflected in which moves the `DesiredBalancerReconciler` selects to do first given that
259259
it can only move, recover, or remove a limited number of shards at once. The first priority is assigning unassigned shards, primaries being
260-
more important than replicas. The second is to move shards that violate node resource limits or hard limits. The `AllocationDeciders` holds
261-
a group of `AllocationDecider` types that place hard constraints on shard allocation. There is a decider that manages disk memory usage
262-
thresholds that does not allow further shard assignment or requires shards to be moved off; or another that excludes a configurable list of
263-
indices from certain nodes; or one that will not attempt to recover a shard on a certain node after so many failed retries. The third
264-
priority is to rebalance shards to even out the relative weight of shards on each node: the intention is to avoid, or ease, future
265-
hot-spotting on data nodes due to too many shards being placed on the same data node. Node shard weight is based on a sum of factors: disk
266-
memory usage, projected shard write load, total number of shards, and an incentive to spread out shards within the same index. See the
267-
`WeightFunction` and `NodeAllocationStatsAndWeightsCalculator` classes for more details on the weight calculations that support the
268-
`DesiredBalanceComputer` decisions.
260+
more important than replicas. The second is to move shards that violate any rule (such as node resource limits) as defined by an
261+
`AllocationDecider`. The `AllocationDeciders` holds a group of `AllocationDecider` implementations that place hard constraints on shard
262+
allocation. There is a decider, `DiskThresholdDecider`, that manages disk memory usage thresholds, such that further shards may not be
263+
allowed assignment to a node, or shards may be required to move off because they grew to exceed the disk space; or another,
264+
`FilterAllocationDecider`, that excludes a configurable list of indices from certain nodes; or `MaxRetryAllocationDecider` that will not
265+
attempt to recover a shard on a certain node after so many failed retries. The third priority is to rebalance shards to even out the
266+
relative weight of shards on each node: the intention is to avoid, or ease, future hot-spotting on data nodes due to too many shards being
267+
placed on the same data node. Node shard weight is based on a sum of factors: disk memory usage, projected shard write load, total number
268+
of shards, and an incentive to distribute shards within the same index across different nodes. See the `WeightFunction` and
269+
`NodeAllocationStatsAndWeightsCalculator` classes for more details on the weight calculations that support the `DesiredBalanceComputer`
270+
decisions.
269271

270272
# Autoscaling
271273

0 commit comments

Comments
 (0)