|
| 1 | +# HashiCorp Raft Divergences |
| 2 | + |
| 3 | +In 2013 HashiCorp created its own Raft implementation based on the just |
| 4 | +released [Raft paper by Diego Ongaro and John Ousterhout][paper]. This was |
| 5 | +before [Diego's subsequent Raft dissertation][diss] in 2014, and long before |
| 6 | +third party analyses such as Heidi Howard and Ittai Abraham's [Raft does not |
| 7 | +Guarantee Liveness in the face of Network Faults ][live] |
| 8 | +in 2020.[^1] |
| 9 | + |
| 10 | +HashiCorp's Raft library usage grew rapidly through its use in [Consul][consul] |
| 11 | +and [Nomad][nomad], and [later Vault][vault], in parallel with rapidly |
| 12 | +expanding use in [etcd][etcd] and other implementations. |
| 13 | + |
| 14 | +The explosion in activity between live systems and research led to a wide |
| 15 | +divergence between not only implementations, but implementations and the |
| 16 | +original paper and dissertation. |
| 17 | + |
| 18 | +This document attempts to explain where HashiCorp Raft either meaningfully diverges |
| 19 | +from the original Raft paper, or makes an implementation choice not explicitly |
| 20 | +outlined in the paper. |
| 21 | + |
| 22 | +This is **not** expected to be a comprehensive list. Additions and edits are |
| 23 | +welcome! |
| 24 | + |
| 25 | +## Asynchronous Heartbeats |
| 26 | + |
| 27 | +The Raft paper defines heartbeats as empty AppendEntries RPCs which are sent by |
| 28 | +the leader to each server after elections and during idle periods to prevent |
| 29 | +election timeouts. |
| 30 | + |
| 31 | +HashiCorp Raft performs [heartbeating concurrently][async-heart] with other |
| 32 | +AppendEntries RPCs to avoid having to set the election timeout high enough to |
| 33 | +account for the max acceptable disk operation. This allows the heartbeat |
| 34 | +timeout to detect network partitions much more quickly without risking causing |
| 35 | +an election during periodic but ephemeral spikes in disk io latency. |
| 36 | + |
| 37 | +## Rejecting votes when there's already a leader |
| 38 | + |
| 39 | +The [Raft does not Guarantee liveness][live] paper describes how certain |
| 40 | +partitions can prevent Raft clusters from making progress by causing continual |
| 41 | +elections. |
| 42 | + |
| 43 | +HashiCorp Raft implements the second of the suggested fixes from Howard's |
| 44 | +paper: rejecting vote request RPCs when there is already an established leader. |
| 45 | +The paper defines this more precisely as: |
| 46 | + |
| 47 | +> ...ignore RequestVote RPCs if they have received an AppendEntries RPC from |
| 48 | +> the leader within the election timeout. |
| 49 | +
|
| 50 | +This approach is actually mentioned in the Cluster membership changes section |
| 51 | +of the original Raft paper, but explicitly excludes its use during "normal" |
| 52 | +elections: |
| 53 | + |
| 54 | +> To prevent this problem, servers disregard RequestVote RPCs when they believe |
| 55 | +> a current leader exists. Specifically, if a server receives a RequestVote RPC |
| 56 | +> within the minimum election timeout of hearing from a current leader, it does |
| 57 | +> not update its term or grant its vote. This does not affect normal |
| 58 | +> elections... |
| 59 | +
|
| 60 | +So HashiCorp Raft follows the later paper's suggestion and ignores the original |
| 61 | +paper's exclusion of this logic during normal operation. |
| 62 | + |
| 63 | +## Pre-Vote |
| 64 | + |
| 65 | +[HashiCorp Raft implements the Pre-Vote extension][prevote-pr] defined in the |
| 66 | +[Raft dissertation][diss] (§9.6). Pre-Vote is an optimization where a candidate |
| 67 | +discovers whether its index is up to date and therefore able to win an election |
| 68 | +before incrementing its term and causing an election. |
| 69 | + |
| 70 | +The Pre-Vote extension is enabled by default but may be disabled in using the |
| 71 | +[Config.PreVoteDisabled][prevote-config] flag. |
| 72 | + |
| 73 | +## Leadership Transfer |
| 74 | + |
| 75 | +[HashiCorp Raft implements the Leadership Transfer extension][transleader-pr] |
| 76 | +as defined in the [Raft dissertation][diss] (§3.10). Leadership transfer is an |
| 77 | +optimization that allows the current leader to hand off leadership to a |
| 78 | +follower to avoid waiting for the election timeout during regular operations |
| 79 | +such as restarts and upgrades. |
| 80 | + |
| 81 | +While leadership transfer in defined in the Raft dissertation, HashiCorp Raft |
| 82 | +extends the specification slightly because of _another_ divergence in HashiCorp |
| 83 | +Raft: [rejecting votes when there's already a |
| 84 | +leader](#rejecting-votes-when-theres-already-a-leader). Since other followers |
| 85 | +would reject the intended new-leader's request for a vote, HashiCorp Raft adds |
| 86 | +an extra [`LeadershipTransfer` flag][transleader-flag] to override that |
| 87 | +behavior in the case of leadership transfers. |
| 88 | + |
| 89 | +All Raft members should support leadership transfers before a transfer is |
| 90 | +attempted. The feature is **not** enabled by default and requires explicitly |
| 91 | +triggering at the application level. Consul was the first to implement this via |
| 92 | +mechanisms in their [API/CLI][transleader-cli] and [graceful agent |
| 93 | +shutdown][transleader-shutdown]. |
| 94 | + |
| 95 | +[^1]: See https://raft.github.io/ for a comprehensive list of papers and |
| 96 | + resources. |
| 97 | + |
| 98 | +[paper]: https://raft.github.io/raft.pdf |
| 99 | +[diss]: https://github.com/ongardie/dissertation#readme |
| 100 | +[live]: https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/ |
| 101 | +[consul]: https://github.com/hashicorp/consul |
| 102 | +[nomad]: https://github.com/hashicorp/nomad |
| 103 | +[vault]: https://github.com/hashicorp/vault |
| 104 | +[etcd]: https://etcd.io/ |
| 105 | +[async-heart]: https://github.com/hashicorp/raft/blob/v1.7.3/replication.go#L385-L387 |
| 106 | +[prevote-pr]: https://github.com/hashicorp/raft/pull/530 |
| 107 | +[prevote-config]: https://pkg.go.dev/github.com/hashicorp/raft#Config.PreVoteDisabled |
| 108 | +[transleader-pr]: https://github.com/hashicorp/raft/pull/306 |
| 109 | +[transleader-flag]: https://pkg.go.dev/github.com/hashicorp/raft#RequestVoteRequest.LeadershipTransfer |
| 110 | +[transleader-cli]: https://github.com/hashicorp/consul/issues/5405 |
| 111 | +[transleader-shutdown]: https://github.com/hashicorp/consul/issues/5406 |
0 commit comments