Skip to content

Commit a7c304f

Browse files
committed
storeliveness: add documentation
This commit adds a doc.go file in the storeliveness package to document the main functionality and interactions with other components. Fixes: #126198 Release note: None
1 parent 62a0436 commit a7c304f

File tree

2 files changed

+260
-0
lines changed

2 files changed

+260
-0
lines changed

pkg/kv/kvserver/storeliveness/BUILD.bazel

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ go_library(
44
name = "storeliveness",
55
srcs = [
66
"config.go",
7+
"doc.go",
78
"fabric.go",
89
"metrics.go",
910
"persist.go",
Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
// Copyright 2024 The Cockroach Authors.
2+
//
3+
// Use of this software is governed by the CockroachDB Software License
4+
// included in the /LICENSE file.
5+
6+
// This package defines and implements a distributed heartbeat-based failure
7+
// detection mechanism. It is used by Raft and powers leader leases.
8+
//
9+
// 1. API
10+
//
11+
// Store liveness provides a simple API that allows callers to ask if the local
12+
// store considers another store to be alive. The signal whether a store is
13+
// alive or not is expressed in terms of pair-wise directed support between
14+
// stores.
15+
//
16+
// Store liveness support has two components:
17+
//
18+
// - epoch: a monotonically increasing integer tracking the current
19+
// uninterrupted period of support.
20+
//
21+
// - expiration: a timestamp denoting the end of the support period for this
22+
// epoch, unless it's extended before support for the epoch is withdrawn.
23+
//
24+
// The store liveness API defined in `fabric.go` provides the following methods:
25+
//
26+
// - `SupportFor(id slpb.StoreIdent) (slpb.Epoch, bool)`: returns the current
27+
// epoch and whether support is provided for the specified store.
28+
//
29+
// - `SupportFrom(id slpb.StoreIdent) (slpb.Epoch, hlc.Timestamp)`: returns the
30+
// current epoch and expiration timestamp of support received from the
31+
// specified store.
32+
//
33+
//
34+
// 2. Algorithm Overview
35+
//
36+
// A local store tracks support received from and provided to remote stores.
37+
// Stores request and provide support using an RPC-based heartbeat mechanism.
38+
//
39+
// 2.1. Requesting Support
40+
//
41+
// A local store requests support from a remote store by sending a heartbeat
42+
// message containing an epoch and expiration. Epochs are monotonically
43+
// increasing, and expirations are timestamps drawn from the requester's clock
44+
// and extended to a few seconds into the future. See `requester_state.go` for
45+
// implementation details.
46+
//
47+
// 2.2. Providing Support
48+
//
49+
// A local store also receives support requests from remote stores. The local
50+
// store can decide to either provide support or not based on the epoch in the
51+
// support request. If the epoch is greater than of equal to the one the local
52+
// store has recorded for the remote store, it records/extends the expiration
53+
// with the requested one. Otherwise, it rejects the support request and doesn't
54+
// change the support state for the remote store. Regardless of the outcome, the
55+
// local store responds with the current support state for the remote store. See
56+
// `supporter_state.go` for implementation details.
57+
//
58+
// 2.3. Withdrawing Support
59+
//
60+
// A local store periodically checks if the support it provides for remote
61+
// stores has expired. When it determines that an expiration timestamp for a
62+
// given remote store is in the past, it increments the remote store's epoch and
63+
// sets the expiration to zero. This effectively withdraws support for the
64+
// remote store. By incrementing the epoch, it ensures that providing support
65+
// for a given epoch is a one-way transition. See `supporter_state.go` for
66+
// implementation details.
67+
//
68+
// 2.4. Other State Management
69+
//
70+
// A local store also increments its own epoch upon restart to ensure that any
71+
// support received prior to the restart is invalidated. This is not strictly
72+
// necessary for the store liveness correctness, but it's useful for the Raft
73+
// and leader lease properties. Support provided for other stores is not
74+
// withdrawn upon restart, as the local store has made a promise to provide
75+
// support until the expiration time.
76+
//
77+
// In order to ensure many of the properties in the next section, the algorithm
78+
// also persists a lot of the support state to disk. See
79+
// `storelivenesspb/service.proto` for details on why each individual field
80+
// needs to be persisted or not. Notably, support received from other stores is
81+
// not persisted since all of that support will be invalidated upon restart (by
82+
// incrementing the epoch).
83+
//
84+
// Store liveness support is established on a per-need basis, as opposed to
85+
// between all pairs of stores in the cluster. The need for support is driven
86+
// by Raft (more on this in the Usage/Raft section). When support from a remote
87+
// store is not requested for some time, the local store stops heartbeating the
88+
// remote store.
89+
//
90+
//
91+
// 3. Properties and Guarantees
92+
//
93+
// The main store liveness guarantees are:
94+
//
95+
// - Support Durability: If a store had support for another store in a given
96+
// epoch, either support is still upheld or has expired according to the
97+
// support provider's clock.
98+
//
99+
// - Support Disjointness: For any requester and supporter, no two support
100+
// intervals overlap in time.
101+
//
102+
// These and other safety guarantees are formally specified
103+
// and verified using TLA+ in `docs/tla-plus/StoreLiveness/StoreLiveness.tla`.
104+
//
105+
// The store liveness algorithm remains correct in the presence of the following
106+
// faults:
107+
//
108+
// - Node crashes and restarts: State is persisted to disk to handle restarts.
109+
// See `storelivenesspb/service.proto` for details on how each persisted field
110+
// is used.
111+
//
112+
// - Disk stalls: Calls to `SupportFor` and `SupportFrom` do not perform disk
113+
// reads or writes, so no store liveness callers stall. Moreover, heartbeating
114+
// requires a disk write, so a store with a stalled disk cannot request
115+
// support.
116+
//
117+
// - Partitions and partial partitions: Pairwise support between stores helps
118+
// react to any partitions by withdrawing support from unreachable stores.
119+
//
120+
// - Message loss and re-ordering: Lost heartbeats are simply resent on the next
121+
// iteration of the heartbeat loop, and lost heartbeat responses are replaced
122+
// by the responses to subsequent heartbeats. Neither hurts correctness, but
123+
// could hurt liveness as support could expire in the meantime. Re-ordered
124+
// messages are handled by ignoring stale messages (with a lower epoch or
125+
// expiration).
126+
//
127+
// - Clock skew and drift: Clock skew and drift do not affect the correctness of
128+
// store liveness but could hurt liveness if, for example, a support
129+
// requester's clock is much slower than the support provider's clocks. For
130+
// the support disjointness property, we need to propagate clock readings
131+
// across messages and forward the receiver's clock in order to preserve
132+
// causality (see `MessageBatch.Now` in `storelivenesspb/service.proto`).
133+
//
134+
// - Clock regressions upon startup: While clocks are usually assumed to be
135+
// monotonically increasing even across restarts, store liveness doesn't rely
136+
// on this property. See the comment for `SupporterMeta.MaxWithdrawn` in
137+
// `storelivenesspb/service.proto`.
138+
//
139+
// 4. Usage
140+
//
141+
// As a general-purpose failure detection layer, store liveness can be
142+
// queried by any component at or below the store level. However, the primary
143+
// consumer of store liveness signal is Raft in service of leader leases.
144+
//
145+
// 4.1. Raft and Leader Leases
146+
//
147+
// Raft uses fortification to strengthen leader election with a promise of store
148+
// liveness support. A Raft leader is considered fortified when it has been
149+
// elected as the leader via Raft, and its store has received support from the
150+
// stores of a quorum of peers. The earliest support expiration among these
151+
// peers (the quorum supported expiration or QSE) serves as a promise that these
152+
// peers will not campaign or vote for a new leader until that time. This
153+
// promise guarantees the fortified leader that it hasn't been replaced by
154+
// another peer, and can thus serve as the leaseholder until the QSE.
155+
//
156+
// A leader uses the ability to evaluate support from a quorum in two ways:
157+
//
158+
// - To determine whether to step down as leader.
159+
// - To determine when a leader lease associated with the leader's term expires.
160+
//
161+
// Conceptually, leader fortification can be thought of as a one-time Raft-level
162+
// heartbeat broadcast which hands the responsibility of heartbeating the
163+
// leader's liveness over to the store liveness layer. See
164+
// `raft/tracker/fortificationtracker.go` for more details on the interaction
165+
// between store liveness and Raft.
166+
//
167+
// Note that the expiration timestamps that store liveness exposes are used as
168+
// timestamps from the MVCC domain, and not from the real time domain. So, it's
169+
// important that these expirations always come from the support requester's
170+
// clock, which in the leader lease layer is potentially the leaseholder that
171+
// uses the expiration as a signal for lease expiration. This also implies that
172+
// when the local store has support up to time t from another store, it doesn't
173+
// matter exactly when the other store withdraws support after its clock exceeds
174+
// time t; both stores have agreed that the promise for support applies up to
175+
// MVCC time t (e.g. proposals with timestamps less than t).
176+
//
177+
// 5. Key Components
178+
//
179+
// 5.1. SupportManager
180+
//
181+
// The `SupportManager` orchestrates store liveness operations. It handles:
182+
//
183+
// - Sending periodic heartbeats to other stores.
184+
// - Processing incoming heartbeat messages.
185+
// - Periodically checking for support withdrawal.
186+
// - Adding and removing stores from the support data structures.
187+
//
188+
// The `SupportManager` delegates these operations to the
189+
// `supporterStateHandler` and `requesterStateHandler` components, which
190+
// implement the actual logic for maintaining correct store liveness state.
191+
//
192+
// 5.2. Transport
193+
//
194+
// The `Transport` handles network communication for store liveness messages.
195+
// This component is instantiated at the node level and maintains message
196+
// handlers for the individual stores on the node. It provides:
197+
//
198+
// - Asynchronous message sending and receiving. `Transport` stores outgoing
199+
// messages in per-destination-node queues, while incoming messages are
200+
// streamed and routed to per-store receive queues in each store's
201+
// `SupportManager`.
202+
//
203+
// - Message batching for efficiency. Outgoing messages are batched by
204+
// accumulating them in the send queues for a short duration (10ms). This
205+
// process is also aided by synchronizing the sending of heartbeats across all
206+
// stores on a given node.
207+
//
208+
// 5.3. Configuration
209+
//
210+
// Store liveness can be enabled/disabled using the `kv.store_liveness.enabled`
211+
// cluster setting. When the setting is disabled, store liveness will not send
212+
// heartbeats, but will still respond to support requests by other stores, as
213+
// well as calls to `SupportFor` and `SupportFrom`. This is required for
214+
// correctness.
215+
//
216+
// Additionally, `config.go` defines tunable configuration parameters for the
217+
// various timeouts and intervals that store liveness uses. Other intervals
218+
// (like support duration and heartbeat interval) are defined in
219+
// `RaftConfig.StoreLivenessDurations()` in `base/config.go` in order to stay
220+
// more closely tuned to Raft parameters.
221+
//
222+
// 5.4. Observability
223+
//
224+
// `TransportMetrics` and `SupportManagerMetrics` in `metrics.go` expose various
225+
// metrics for monitoring store liveness status, including heartbeat success and
226+
// failure rates, support relationship statistics (support provided and
227+
// received), transport-level metrics, and more.
228+
//
229+
// Store liveness support state is also exposed via `inspectz` endpoints
230+
// (defined in `inspectz/inspectz.go`):
231+
//
232+
// - `inspectz/storeliveness/supportFor`: A list of all stores for which the
233+
// local store provides support, including epochs and expiration times.
234+
//
235+
// - `inspectz/storeliveness/supportFrom`: A list of all stores from which the
236+
// local store receives support, including epochs and expiration times.
237+
//
238+
// At the default logging level, store liveness logs important state transitions
239+
// like starting to heartbeat a store, providing/receiving support for the first
240+
// time, and withdrawing support. At vmodule level 2, it logs additional info
241+
// like the number of heartbeats sent and messages handled. At vmodule level 3,
242+
// it logs all support extensions as well.
243+
//
244+
// 5.5. Testing
245+
//
246+
// The package includes comprehensive tests in:
247+
//
248+
// - `support_manager_test.go`: Unit tests for the `SupportManager`.
249+
//
250+
// - `transport_test.go`: Unit tests for the `Transport`.
251+
//
252+
// - `store_liveness_test.go`: Data-driven unit tests for the store liveness
253+
// logic implementation.
254+
//
255+
// - `multi_store_test.go`: End-to-end tests that validate the store liveness
256+
// behavior in the happy case and the three major fault patterns: node
257+
// restarts, disk stalls, and partial network partitions.
258+
259+
package storeliveness

0 commit comments

Comments
 (0)