Skip to content

Commit 5297629

Browse files
authored
docs: document divergence (hashicorp#668)
Adds a document about where HashiCorp Raft diverges from the original paper. This is not meant to be exhaustive or comprehensive. Additions and edits welcome! * clarify where the "second fix" came from thanks to @otoolep * add prevote and leadership transfer thanks to @tgross
1 parent 10dcdbf commit 5297629

File tree

2 files changed

+112
-0
lines changed

2 files changed

+112
-0
lines changed

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ or contributing to the code.
1010
2. [Operations](#operations)
1111
1. [Apply](./apply.md)
1212
3. [Threads](#threads)
13+
4. [Divergence](./divergence.md)
1314

1415

1516
## Terminology

docs/divergence.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# HashiCorp Raft Divergences
2+
3+
In 2013 HashiCorp created its own Raft implementation based on the just
4+
released [Raft paper by Diego Ongaro and John Ousterhout][paper]. This was
5+
before [Diego's subsequent Raft dissertation][diss] in 2014, and long before
6+
third party analyses such as Heidi Howard and Ittai Abraham's [Raft does not
7+
Guarantee Liveness in the face of Network Faults ][live]
8+
in 2020.[^1]
9+
10+
HashiCorp's Raft library usage grew rapidly through its use in [Consul][consul]
11+
and [Nomad][nomad], and [later Vault][vault], in parallel with rapidly
12+
expanding use in [etcd][etcd] and other implementations.
13+
14+
The explosion in activity between live systems and research led to a wide
15+
divergence between not only implementations, but implementations and the
16+
original paper and dissertation.
17+
18+
This document attempts to explain where HashiCorp Raft either meaningfully diverges
19+
from the original Raft paper, or makes an implementation choice not explicitly
20+
outlined in the paper.
21+
22+
This is **not** expected to be a comprehensive list. Additions and edits are
23+
welcome!
24+
25+
## Asynchronous Heartbeats
26+
27+
The Raft paper defines heartbeats as empty AppendEntries RPCs which are sent by
28+
the leader to each server after elections and during idle periods to prevent
29+
election timeouts.
30+
31+
HashiCorp Raft performs [heartbeating concurrently][async-heart] with other
32+
AppendEntries RPCs to avoid having to set the election timeout high enough to
33+
account for the max acceptable disk operation. This allows the heartbeat
34+
timeout to detect network partitions much more quickly without risking causing
35+
an election during periodic but ephemeral spikes in disk io latency.
36+
37+
## Rejecting votes when there's already a leader
38+
39+
The [Raft does not Guarantee liveness][live] paper describes how certain
40+
partitions can prevent Raft clusters from making progress by causing continual
41+
elections.
42+
43+
HashiCorp Raft implements the second of the suggested fixes from Howard's
44+
paper: rejecting vote request RPCs when there is already an established leader.
45+
The paper defines this more precisely as:
46+
47+
> ...ignore RequestVote RPCs if they have received an AppendEntries RPC from
48+
> the leader within the election timeout.
49+
50+
This approach is actually mentioned in the Cluster membership changes section
51+
of the original Raft paper, but explicitly excludes its use during "normal"
52+
elections:
53+
54+
> To prevent this problem, servers disregard RequestVote RPCs when they believe
55+
> a current leader exists. Specifically, if a server receives a RequestVote RPC
56+
> within the minimum election timeout of hearing from a current leader, it does
57+
> not update its term or grant its vote. This does not affect normal
58+
> elections...
59+
60+
So HashiCorp Raft follows the later paper's suggestion and ignores the original
61+
paper's exclusion of this logic during normal operation.
62+
63+
## Pre-Vote
64+
65+
[HashiCorp Raft implements the Pre-Vote extension][prevote-pr] defined in the
66+
[Raft dissertation][diss] (§9.6). Pre-Vote is an optimization where a candidate
67+
discovers whether its index is up to date and therefore able to win an election
68+
before incrementing its term and causing an election.
69+
70+
The Pre-Vote extension is enabled by default but may be disabled in using the
71+
[Config.PreVoteDisabled][prevote-config] flag.
72+
73+
## Leadership Transfer
74+
75+
[HashiCorp Raft implements the Leadership Transfer extension][transleader-pr]
76+
as defined in the [Raft dissertation][diss] (§3.10). Leadership transfer is an
77+
optimization that allows the current leader to hand off leadership to a
78+
follower to avoid waiting for the election timeout during regular operations
79+
such as restarts and upgrades.
80+
81+
While leadership transfer in defined in the Raft dissertation, HashiCorp Raft
82+
extends the specification slightly because of _another_ divergence in HashiCorp
83+
Raft: [rejecting votes when there's already a
84+
leader](#rejecting-votes-when-theres-already-a-leader). Since other followers
85+
would reject the intended new-leader's request for a vote, HashiCorp Raft adds
86+
an extra [`LeadershipTransfer` flag][transleader-flag] to override that
87+
behavior in the case of leadership transfers.
88+
89+
All Raft members should support leadership transfers before a transfer is
90+
attempted. The feature is **not** enabled by default and requires explicitly
91+
triggering at the application level. Consul was the first to implement this via
92+
mechanisms in their [API/CLI][transleader-cli] and [graceful agent
93+
shutdown][transleader-shutdown].
94+
95+
[^1]: See https://raft.github.io/ for a comprehensive list of papers and
96+
resources.
97+
98+
[paper]: https://raft.github.io/raft.pdf
99+
[diss]: https://github.com/ongardie/dissertation#readme
100+
[live]: https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/
101+
[consul]: https://github.com/hashicorp/consul
102+
[nomad]: https://github.com/hashicorp/nomad
103+
[vault]: https://github.com/hashicorp/vault
104+
[etcd]: https://etcd.io/
105+
[async-heart]: https://github.com/hashicorp/raft/blob/v1.7.3/replication.go#L385-L387
106+
[prevote-pr]: https://github.com/hashicorp/raft/pull/530
107+
[prevote-config]: https://pkg.go.dev/github.com/hashicorp/raft#Config.PreVoteDisabled
108+
[transleader-pr]: https://github.com/hashicorp/raft/pull/306
109+
[transleader-flag]: https://pkg.go.dev/github.com/hashicorp/raft#RequestVoteRequest.LeadershipTransfer
110+
[transleader-cli]: https://github.com/hashicorp/consul/issues/5405
111+
[transleader-shutdown]: https://github.com/hashicorp/consul/issues/5406

0 commit comments

Comments
 (0)