Commit a5660a6
Merge #156830
156830: storeliveness: smear storeliveness heartbeat messages to reduce goroutine spikes at heartbeat interval tick r=miraradeva,iskettaneh a=dodeca12
This PR introduces heartbeat smearing logic that batches and smears Store Liveness heartbeat sends across destination nodes to prevent thundering herd of goroutine spikes.
### Changes
Core changes are within these files:
```sh
pkg/kv/kvserver/storeliveness
├── support_manager.go # Rename SendAsync→EnqueueMessage, add smearing settings
└── transport.go # Add smearing sender goroutine to transport.go which takes care of smearing when enabled
```
### Background
Previously, all stores in a cluster sent heartbeats immediately at each heartbeat interval tick. In large clusters with multi-store nodes, this created synchronized bursts of goroutine spikes that caused issues in other parts of the running CRDB process.
### Commits
**Commit: Introduce heartbeat smearing**
- Adds a smearing sender goroutine to `transport.go` that batches enqueued messages
- Smears send signals across queues using `taskpacer` to spread traffic over time
- Configurable via cluster settings (default: enabled)
**How it works:**
1. Messages are enqueued via `EnqueueMessage()` into per-node queues
2. When `SendAllEnqueuedMessages()` is called, transport's smearing sender goroutine waits briefly to batch messages
3. Transport's smearing sender goroutine uses `taskpacer` to pace signaling to each queue over a smear duration
4. Each `processQueue` goroutine drains its queue and sends when signalled
### New Cluster Settings
- `kv.store_liveness.heartbeat_smearing.enabled` (default: true) - Enable/disable smearing
- `kv.store_liveness.heartbeat_smearing.refresh` (default: 10ms) - Batching window duration
- `kv.store_liveness.heartbeat_smearing.smear` (default: 1ms) - Time to spread sends across queues
### Backward Compatibility
- Feature is disabled by setting `kv.store_liveness.heartbeat_smearing.enabled=false`
- When disabled, messages are sent immediately via the existing path (non-smearing mode)
### Testing
- Added comprehensive unit tests verifying:
- Messages batch correctly across multiple destinations
- Smearing spreads signals over the configured time window
- Smearing mode vs immediate mode behaviour
- Message ordering and reliability
All existing tests updated to call `SendAllEnqueuedMessages()` after enqueuing when smearing is enabled.
#### Roachprod testing
##### Prototype #1
- Ran a prototype with a [similar design](#154942) (TL;DR of prototype, the heartbeats were smeared via `SupportManager` goroutines being put to sleep; this current design ensures that `SupportManager` goroutines do not get blocked) on a roachprod with 150 node cluster to verify smearing works.
| Before changes (current behaviour on master) | After changes (prototype) |
|--------|--------|
| <img width="2680" height="570" alt="image" src="https://github.com/user-attachments/assets/32fe6ee0-437f-48eb-b3f1-087a3eafe6ac" /> | <img width="2692" height="634" alt="image" src="https://github.com/user-attachments/assets/66b5b82b-bbc4-4f47-a13e-5f6d42a1c6d4" /> |
##### Current changes
- Ran a roachprod test with current changes but without the check for empty queues (more info - https://reviewable.io/reviews/cockroachdb/cockroach/156378#-). This check did end up proving vital, as the test results didn't show the expected smearing behaviour.
- Ran a mini-roachprod test on [this prototype commit](https://github.com/cockroachdb/cockroach/pull/155317/files#diff-9282b4b1d9a2fe32fae81e5776eb081e58069b4bc7db76718873b75f026e16c1) (where the only real difference between my changes is the inclusion of the length check on the queues that have messages on that commit) showed expected smearing behaviour.
<img width="1797" height="469" alt="image" src="https://github.com/user-attachments/assets/bd7778ef-9f8d-4dbf-8ed2-dac40e7fb03c" />
Fixes: #148210
Release note: None
Co-authored-by: Swapneeth Gorantla <[email protected]>File tree
6 files changed
+628
-88
lines changed- pkg
- kv/kvserver
- storeliveness
- server
- testutils/localtestcluster
6 files changed
+628
-88
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
248 | 248 | | |
249 | 249 | | |
250 | 250 | | |
251 | | - | |
| 251 | + | |
252 | 252 | | |
253 | 253 | | |
254 | 254 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| 33 | + | |
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
36 | 37 | | |
| 38 | + | |
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| |||
0 commit comments