|
| 1 | +// Copyright 2024 The Cockroach Authors. |
| 2 | +// |
| 3 | +// Use of this software is governed by the CockroachDB Software License |
| 4 | +// included in the /LICENSE file. |
| 5 | + |
| 6 | +// This package defines and implements a distributed heartbeat-based failure |
| 7 | +// detection mechanism. It is used by Raft and powers leader leases. |
| 8 | +// |
| 9 | +// 1. API |
| 10 | +// |
| 11 | +// Store liveness provides a simple API that allows callers to ask if the local |
| 12 | +// store considers another store to be alive. The signal whether a store is |
| 13 | +// alive or not is expressed in terms of pair-wise directed support between |
| 14 | +// stores. |
| 15 | +// |
| 16 | +// Store liveness support has two components: |
| 17 | +// |
| 18 | +// - epoch: a monotonically increasing integer tracking the current |
| 19 | +// uninterrupted period of support. |
| 20 | +// |
| 21 | +// - expiration: a timestamp denoting the end of the support period for this |
| 22 | +// epoch, unless it's extended before support for the epoch is withdrawn. |
| 23 | +// |
| 24 | +// The store liveness API defined in `fabric.go` provides the following methods: |
| 25 | +// |
| 26 | +// - `SupportFor(id slpb.StoreIdent) (slpb.Epoch, bool)`: returns the current |
| 27 | +// epoch and whether support is provided for the specified store. |
| 28 | +// |
| 29 | +// - `SupportFrom(id slpb.StoreIdent) (slpb.Epoch, hlc.Timestamp)`: returns the |
| 30 | +// current epoch and expiration timestamp of support received from the |
| 31 | +// specified store. |
| 32 | +// |
| 33 | +// |
| 34 | +// 2. Algorithm Overview |
| 35 | +// |
| 36 | +// A local store tracks support received from and provided to remote stores. |
| 37 | +// Stores request and provide support using an RPC-based heartbeat mechanism. |
| 38 | +// |
| 39 | +// 2.1. Requesting Support |
| 40 | +// |
| 41 | +// A local store requests support from a remote store by sending a heartbeat |
| 42 | +// message containing an epoch and expiration. Epochs are monotonically |
| 43 | +// increasing, and expirations are timestamps drawn from the requester's clock |
| 44 | +// and extended to a few seconds into the future. See `requester_state.go` for |
| 45 | +// implementation details. |
| 46 | +// |
| 47 | +// 2.2. Providing Support |
| 48 | +// |
| 49 | +// A local store also receives support requests from remote stores. The local |
| 50 | +// store can decide to either provide support or not based on the epoch in the |
| 51 | +// support request. If the epoch is greater than of equal to the one the local |
| 52 | +// store has recorded for the remote store, it records/extends the expiration |
| 53 | +// with the requested one. Otherwise, it rejects the support request and doesn't |
| 54 | +// change the support state for the remote store. Regardless of the outcome, the |
| 55 | +// local store responds with the current support state for the remote store. See |
| 56 | +// `supporter_state.go` for implementation details. |
| 57 | +// |
| 58 | +// 2.3. Withdrawing Support |
| 59 | +// |
| 60 | +// A local store periodically checks if the support it provides for remote |
| 61 | +// stores has expired. When it determines that an expiration timestamp for a |
| 62 | +// given remote store is in the past, it increments the remote store's epoch and |
| 63 | +// sets the expiration to zero. This effectively withdraws support for the |
| 64 | +// remote store. By incrementing the epoch, it ensures that providing support |
| 65 | +// for a given epoch is a one-way transition. See `supporter_state.go` for |
| 66 | +// implementation details. |
| 67 | +// |
| 68 | +// 2.4. Other State Management |
| 69 | +// |
| 70 | +// A local store also increments its own epoch upon restart to ensure that any |
| 71 | +// support received prior to the restart is invalidated. This is not strictly |
| 72 | +// necessary for the store liveness correctness, but it's useful for the Raft |
| 73 | +// and leader lease properties. Support provided for other stores is not |
| 74 | +// withdrawn upon restart, as the local store has made a promise to provide |
| 75 | +// support until the expiration time. |
| 76 | +// |
| 77 | +// In order to ensure many of the properties in the next section, the algorithm |
| 78 | +// also persists a lot of the support state to disk. See |
| 79 | +// `storelivenesspb/service.proto` for details on why each individual field |
| 80 | +// needs to be persisted or not. Notably, support received from other stores is |
| 81 | +// not persisted since all of that support will be invalidated upon restart (by |
| 82 | +// incrementing the epoch). |
| 83 | +// |
| 84 | +// Store liveness support is established on a per-need basis, as opposed to |
| 85 | +// between all pairs of stores in the cluster. The need for support is driven |
| 86 | +// by Raft (more on this in the Usage/Raft section). When support from a remote |
| 87 | +// store is not requested for some time, the local store stops heartbeating the |
| 88 | +// remote store. |
| 89 | +// |
| 90 | +// |
| 91 | +// 3. Properties and Guarantees |
| 92 | +// |
| 93 | +// The main store liveness guarantees are: |
| 94 | +// |
| 95 | +// - Support Durability: If a store had support for another store in a given |
| 96 | +// epoch, either support is still upheld or has expired according to the |
| 97 | +// support provider's clock. |
| 98 | +// |
| 99 | +// - Support Disjointness: For any requester and supporter, no two support |
| 100 | +// intervals overlap in time. |
| 101 | +// |
| 102 | +// These and other safety guarantees are formally specified |
| 103 | +// and verified using TLA+ in `docs/tla-plus/StoreLiveness/StoreLiveness.tla`. |
| 104 | +// |
| 105 | +// The store liveness algorithm remains correct in the presence of the following |
| 106 | +// faults: |
| 107 | +// |
| 108 | +// - Node crashes and restarts: State is persisted to disk to handle restarts. |
| 109 | +// See `storelivenesspb/service.proto` for details on how each persisted field |
| 110 | +// is used. |
| 111 | +// |
| 112 | +// - Disk stalls: Calls to `SupportFor` and `SupportFrom` do not perform disk |
| 113 | +// reads or writes, so no store liveness callers stall. Moreover, heartbeating |
| 114 | +// requires a disk write, so a store with a stalled disk cannot request |
| 115 | +// support. |
| 116 | +// |
| 117 | +// - Partitions and partial partitions: Pairwise support between stores helps |
| 118 | +// react to any partitions by withdrawing support from unreachable stores. |
| 119 | +// |
| 120 | +// - Message loss and re-ordering: Lost heartbeats are simply resent on the next |
| 121 | +// iteration of the heartbeat loop, and lost heartbeat responses are replaced |
| 122 | +// by the responses to subsequent heartbeats. Neither hurts correctness, but |
| 123 | +// could hurt liveness as support could expire in the meantime. Re-ordered |
| 124 | +// messages are handled by ignoring stale messages (with a lower epoch or |
| 125 | +// expiration). |
| 126 | +// |
| 127 | +// - Clock skew and drift: Clock skew and drift do not affect the correctness of |
| 128 | +// store liveness but could hurt liveness if, for example, a support |
| 129 | +// requester's clock is much slower than the support provider's clocks. For |
| 130 | +// the support disjointness property, we need to propagate clock readings |
| 131 | +// across messages and forward the receiver's clock in order to preserve |
| 132 | +// causality (see `MessageBatch.Now` in `storelivenesspb/service.proto`). |
| 133 | +// |
| 134 | +// - Clock regressions upon startup: While clocks are usually assumed to be |
| 135 | +// monotonically increasing even across restarts, store liveness doesn't rely |
| 136 | +// on this property. See the comment for `SupporterMeta.MaxWithdrawn` in |
| 137 | +// `storelivenesspb/service.proto`. |
| 138 | +// |
| 139 | +// 4. Usage |
| 140 | +// |
| 141 | +// As a general-purpose failure detection layer, store liveness can be |
| 142 | +// queried by any component at or below the store level. However, the primary |
| 143 | +// consumer of store liveness signal is Raft in service of leader leases. |
| 144 | +// |
| 145 | +// 4.1. Raft and Leader Leases |
| 146 | +// |
| 147 | +// Raft uses fortification to strengthen leader election with a promise of store |
| 148 | +// liveness support. A Raft leader is considered fortified when it has been |
| 149 | +// elected as the leader via Raft, and its store has received support from the |
| 150 | +// stores of a quorum of peers. The earliest support expiration among these |
| 151 | +// peers (the quorum supported expiration or QSE) serves as a promise that these |
| 152 | +// peers will not campaign or vote for a new leader until that time. This |
| 153 | +// promise guarantees the fortified leader that it hasn't been replaced by |
| 154 | +// another peer, and can thus serve as the leaseholder until the QSE. |
| 155 | +// |
| 156 | +// A leader uses the ability to evaluate support from a quorum in two ways: |
| 157 | +// |
| 158 | +// - To determine whether to step down as leader. |
| 159 | +// - To determine when a leader lease associated with the leader's term expires. |
| 160 | +// |
| 161 | +// Conceptually, leader fortification can be thought of as a one-time Raft-level |
| 162 | +// heartbeat broadcast which hands the responsibility of heartbeating the |
| 163 | +// leader's liveness over to the store liveness layer. See |
| 164 | +// `raft/tracker/fortificationtracker.go` for more details on the interaction |
| 165 | +// between store liveness and Raft. |
| 166 | +// |
| 167 | +// Note that the expiration timestamps that store liveness exposes are used as |
| 168 | +// timestamps from the MVCC domain, and not from the real time domain. So, it's |
| 169 | +// important that these expirations always come from the support requester's |
| 170 | +// clock, which in the leader lease layer is potentially the leaseholder that |
| 171 | +// uses the expiration as a signal for lease expiration. This also implies that |
| 172 | +// when the local store has support up to time t from another store, it doesn't |
| 173 | +// matter exactly when the other store withdraws support after its clock exceeds |
| 174 | +// time t; both stores have agreed that the promise for support applies up to |
| 175 | +// MVCC time t (e.g. proposals with timestamps less than t). |
| 176 | +// |
| 177 | +// 5. Key Components |
| 178 | +// |
| 179 | +// 5.1. SupportManager |
| 180 | +// |
| 181 | +// The `SupportManager` orchestrates store liveness operations. It handles: |
| 182 | +// |
| 183 | +// - Sending periodic heartbeats to other stores. |
| 184 | +// - Processing incoming heartbeat messages. |
| 185 | +// - Periodically checking for support withdrawal. |
| 186 | +// - Adding and removing stores from the support data structures. |
| 187 | +// |
| 188 | +// The `SupportManager` delegates these operations to the |
| 189 | +// `supporterStateHandler` and `requesterStateHandler` components, which |
| 190 | +// implement the actual logic for maintaining correct store liveness state. |
| 191 | +// |
| 192 | +// 5.2. Transport |
| 193 | +// |
| 194 | +// The `Transport` handles network communication for store liveness messages. |
| 195 | +// This component is instantiated at the node level and maintains message |
| 196 | +// handlers for the individual stores on the node. It provides: |
| 197 | +// |
| 198 | +// - Asynchronous message sending and receiving. `Transport` stores outgoing |
| 199 | +// messages in per-destination-node queues, while incoming messages are |
| 200 | +// streamed and routed to per-store receive queues in each store's |
| 201 | +// `SupportManager`. |
| 202 | +// |
| 203 | +// - Message batching for efficiency. Outgoing messages are batched by |
| 204 | +// accumulating them in the send queues for a short duration (10ms). This |
| 205 | +// process is also aided by synchronizing the sending of heartbeats across all |
| 206 | +// stores on a given node. |
| 207 | +// |
| 208 | +// 5.3. Configuration |
| 209 | +// |
| 210 | +// Store liveness can be enabled/disabled using the `kv.store_liveness.enabled` |
| 211 | +// cluster setting. When the setting is disabled, store liveness will not send |
| 212 | +// heartbeats, but will still respond to support requests by other stores, as |
| 213 | +// well as calls to `SupportFor` and `SupportFrom`. This is required for |
| 214 | +// correctness. |
| 215 | +// |
| 216 | +// Additionally, `config.go` defines tunable configuration parameters for the |
| 217 | +// various timeouts and intervals that store liveness uses. Other intervals |
| 218 | +// (like support duration and heartbeat interval) are defined in |
| 219 | +// `RaftConfig.StoreLivenessDurations()` in `base/config.go` in order to stay |
| 220 | +// more closely tuned to Raft parameters. |
| 221 | +// |
| 222 | +// 5.4. Observability |
| 223 | +// |
| 224 | +// `TransportMetrics` and `SupportManagerMetrics` in `metrics.go` expose various |
| 225 | +// metrics for monitoring store liveness status, including heartbeat success and |
| 226 | +// failure rates, support relationship statistics (support provided and |
| 227 | +// received), transport-level metrics, and more. |
| 228 | +// |
| 229 | +// Store liveness support state is also exposed via `inspectz` endpoints |
| 230 | +// (defined in `inspectz/inspectz.go`): |
| 231 | +// |
| 232 | +// - `inspectz/storeliveness/supportFor`: A list of all stores for which the |
| 233 | +// local store provides support, including epochs and expiration times. |
| 234 | +// |
| 235 | +// - `inspectz/storeliveness/supportFrom`: A list of all stores from which the |
| 236 | +// local store receives support, including epochs and expiration times. |
| 237 | +// |
| 238 | +// At the default logging level, store liveness logs important state transitions |
| 239 | +// like starting to heartbeat a store, providing/receiving support for the first |
| 240 | +// time, and withdrawing support. At vmodule level 2, it logs additional info |
| 241 | +// like the number of heartbeats sent and messages handled. At vmodule level 3, |
| 242 | +// it logs all support extensions as well. |
| 243 | +// |
| 244 | +// 5.5. Testing |
| 245 | +// |
| 246 | +// The package includes comprehensive tests in: |
| 247 | +// |
| 248 | +// - `support_manager_test.go`: Unit tests for the `SupportManager`. |
| 249 | +// |
| 250 | +// - `transport_test.go`: Unit tests for the `Transport`. |
| 251 | +// |
| 252 | +// - `store_liveness_test.go`: Data-driven unit tests for the store liveness |
| 253 | +// logic implementation. |
| 254 | +// |
| 255 | +// - `multi_store_test.go`: End-to-end tests that validate the store liveness |
| 256 | +// behavior in the happy case and the three major fault patterns: node |
| 257 | +// restarts, disk stalls, and partial network partitions. |
| 258 | + |
| 259 | +package storeliveness |
0 commit comments