Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 37 additions & 13 deletions docs/internal/DistributedArchitectureGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,19 +229,43 @@ works in parallel with the storage engine.)

# Allocation

(AllocationService runs on the master node)

(Discuss different deciders that limit allocation. Sketch / list the different deciders that we have.)

### APIs for Balancing Operations

(Significant internal APIs for balancing a cluster)

### Heuristics for Allocation

### Cluster Reroute Command

(How does this command behave with the desired auto balancer.)
### Core Components

The `DesiredBalanceShardsAllocator` is what runs shard allocation decisions. It leverages the `DesiredBalanceComputer` to produce
`DesiredBalance` instances for the cluster based on the latest cluster changes (add/remove nodes, create/remove indices, load, etc). Then
the `DesiredBalanceReconciler` is invoked to choose the next steps to take to move the cluster from the current shard allocation to the
latest computed `DesiredBalance` shard allocation. The `Reconciler` will apply changes to a copy of the `RoutingNodes`, which is then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DesiredBalanceReconciler not just Reconciler (here and below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

published in a cluster state update that will reach the data nodes to start the individual shard creation/recovery/move work.

The `Reconciler` is throttled by cluster settings, like the max number of concurrent shard moves and recoveries per cluster and node: this
is why the `Reconciler` will make, and publish via cluster state updates, incremental changes to the cluster shard allocation. The
`DesiredBalanceShardsAllocator` is the endpoint for reroute requests, which may trigger immediate requests to the `Reconciler`, but
asynchronous requests to the `DesiredBalanceComputer` via the `ContinuousComputation` component. Cluster state changes that affect shard
balancing (for example index deletion) all call some reroute method interface that reaches the `DesiredBalanceShardsAllocator` to run
reconciliation and queue a request for the `DesiredBalancerComputer`, leading to desired balance computation and reconciliation actions.
Asynchronous completion of a new `DesiredBalance` will also invoke a reconciliation action, as will cluster state updates completing shard
moves/recoveries (unthrottling the next shard move/recovery).

The `ContinuousComputation` maintains a queue of desired balance computation requests, each of which holds the latest cluster information at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a queue but really it's only tracking the latest request.

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks for catching that.

I guess the enqueued terminology got stuck in my head :)

the time of the request, and a thread that runs the `DesiredBalanceComputer`. The ContinuousComputation thread grabs the latest request,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
the time of the request, and a thread that runs the `DesiredBalanceComputer`. The ContinuousComputation thread grabs the latest request,
the time of the request, and a thread that runs the `DesiredBalanceComputer`. The `ContinuousComputation` thread grabs the latest request,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

with the freshest cluster information, feeds it into the `DesiredBalanceComputer` and publishes a `DesiredBalance` back to the
`DesiredBalanceShardsAllocator` to use for reconciliation actions. Sometimes the `ContinuousComputation` thread's desired balance
computation will be signalled to end early and get a `DesiredBalance` published more quickly, if newer cluster state changes are significant
or to get unassigned shards started as soon as possible.

### Rebalancing Process

There are different priorities in shard allocation, reflected in which moves the `DesiredBalancerReconciler` selects to do first given that
it can only move, recover, or remove a limited number of shards at once. The first priority is assigning unassigned shards, primaries being
more important than replicas. The second is to move shards that violate node resource limits or hard limits. The `AllocationDeciders` holds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

violate node resource limits or hard limits

Really, violating any rule as defined by an AllocationDecider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased 👍

a group of `AllocationDecider` types that place hard constraints on shard allocation. There is a decider that manages disk memory usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a decider

Maybe name these deciders here so folks can go and look them up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

thresholds that does not allow further shard assignment or requires shards to be moved off; or another that excludes a configurable list of
indices from certain nodes; or one that will not attempt to recover a shard on a certain node after so many failed retries. The third
priority is to rebalance shards to even out the relative weight of shards on each node: the intention is to avoid, or ease, future
hot-spotting on data nodes due to too many shards being placed on the same data node. Node shard weight is based on a sum of factors: disk
memory usage, projected shard write load, total number of shards, and an incentive to spread out shards within the same index. See the
`WeightFunction` and `NodeAllocationStatsAndWeightsCalculator` classes for more details on the weight calculations that support the
`DesiredBalanceComputer` decisions.

# Autoscaling

Expand Down