You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Raft is a distributed consensus protocol which replicates data across a
cluster of nodes in a consistent and durable manner. It is described in the very readable
Raft paper, and in the more comprehensive
Raft thesis.
The toyDB Raft implementation is in the raft
module, and is described in the module documentation:
//! 1 | 1 | CREATE TABLE table (id INT PRIMARY KEY, value STRING)
//! 2 | 1 | INSERT INTO table VALUES (1, 'foo')
//! 3 | 2 | UPDATE table SET value = 'bar' WHERE id = 1
//! 4 | 2 | DELETE FROM table WHERE id = 1
//!
//! The state machine must be deterministic, such that all nodes will reach the
//! same identical state. Raft will apply the same commands in the same order
//! independently on all nodes, but if the commands have non-deterministic
//! behavior such as random number generation or communication with external
//! systems it can lead to state divergence causing different results.
//!
//! In toyDB, the Raft log is managed by `Log` and stored locally in a
//! `storage::Engine`. The state machine interface is the `State` trait. See
//! their documentation for more details.
//!
//! LEADER ELECTION
//! ===============
//!
//! Raft nodes can be in one of three states (or roles): follower, candidate,
//! and leader. toyDB models these as `Node::Follower`, `Node::Candidate`, and
//! `Node::Leader`.
//!
//! * Follower: replicates log entries from a leader. May not know a leader yet.
//! * Candidate: campaigns for leadership in an election.
//! * Leader: processes client requests and replicates writes to followers.
//!
//! Raft fundamentally relies on a single guarantee: there can be at most one
//! _valid_ leader at any point in time (old, since-replaced leaders may think
//! they're still a leader, e.g. during a network partition, but they won't be
//! able to do anything). It enforces this through the leader election protocol.
//!
//! Raft divides time into terms, which are monotonically increasing numbers.
//! Higher terms always take priority over lower terms. There can be at most one
//! leader in a term, and it can't change. Nodes keep track of their last known
//! term and store it on disk (see `Log.set_term()`). Messages between nodes are
//! tagged with the current term (as `Envelope.term`) -- old terms are ignored,
//! and future terms cause the node to become a follower in that term.
//!
//! Nodes start out as leaderless followers. If they receive a message from a
//! leader (in a current or future term), they follow it. Otherwise, they wait
//! out the election timeout (a few seconds), become candidates, and hold a
//! leader election.
//!
//! Candidates increase their term by 1 and send `Message::Campaign` to all
//! nodes, requesting their vote. Nodes respond with `Message::CampaignResponse`
//! saying whether a vote was granted. A node can only grant a single vote in a
//! term (stored to disk via `Log.set_term()`), on a first-come first-serve
//! basis, and candidates implicitly vote for themselves.
//!
//! When a candidate receives a majority of votes (>50%), it becomes leader. It
//! sends a `Message::Heartbeat` to all nodes asserting its leadership, and all
//! nodes become followers when they receive it (regardless of who they voted
//! for). Leaders continue to send periodic heartbeats every second or so. The
//! new leader also appends an empty entry to its log in order to safely commit
//! all entries from previous terms (Raft paper section 5.4.2).
//!
//! The new leader must have all committed entries in its log (or the cluster
//! would lose data). To ensure this, there is one additional condition for
//! granting a vote: the candidate's log must be at least as up-to-date as the
//! voter. Because an entry must be replicated to a majority before being
//! committed, this ensures a candidate can only win a majority of votes if its
//! log is up-to-date with all committed entries (Raft paper section 5.4.1).
//!
//! It's possible that no candidate wins an election, for example due to a tie
//! or a majority of nodes being offline. After an election timeout passes,
//! candidates will again bump their term and start a new election, until a
//! leader can be established. To avoid frequent ties, nodes use different,
//! randomized election timeouts (Raft paper section 5.2).
//!
//! Similarly, if a follower doesn't hear from a leader in an election timeout
//! interval, it will become candidate and hold another election. The periodic
//! leader heartbeats prevent this as long as the leader is running and
//! connected. A node that becomes disconnected from the leader will continually
//! hold new elections by itself until the network heals, at which point a new
//! election will be held in its term (disrupting the current leader).
//!
//! REPLICATION AND CONSENSUS
//! =========================
//!
//! When the leader receives a client write request, it appends the command to
//! its local log via `Log.append()`, and sends the log entry to all peers in
//! a `Message::Append`. Followers will attempt to durably append the entry to
//! their local logs and respond with `Message::AppendResponse`.
//!
//! Once a majority have acknowledged the append, the leader commits the entry
//! via `Log.commit()` and applies it to its local state machine, returning the
//! result to the client. It will inform followers about the commit in the next
//! heartbeat as `Message::Heartbeat.commit_index` so they can apply it too, but
//! this is not necessary for correctness (they will commit and apply it if they
//! become leader, otherwise they have no need for applying it).
//!
//! Followers may not be able to append the entry to their log -- they may be
//! unreachable, lag behind the leader, or have divergent logs (see Raft paper
//! section 5.3). The `Append` contains the index and term of the log entry
//! immediately before the replicated entry as `base_index` and `base_term`. An
//! index/term pair uniquely identifies a command, and if two logs have the same
//! index/term pair then the logs are identical up to and including that entry
//! (Raft paper section 5.3). If the base index/term matches the follower's log,
//! it appends the entry (potentially replacing any conflicting entries),
//! otherwise it rejects it.
//!
//! When a follower rejects an append, the leader must try to find a common log
//! entry that exists in both its and the follower's log where it can resume
//! replication. It does this by sending `Message::Append` probes only
//! containing a base index/term but no entries -- it will continue to probe
//! decreasing indexes one by one until the follower responds with a match, then
//! send an `Append` with the missing entries (Raft paper section 5.3). It keeps
//! track of each follower's `match_index` and `next_index` in a `Progress`
//! struct to manage this.
//!
//! In case `Append` messages or responses are lost, leaders also send their
//! `last_index` and term in each `Heartbeat`. If followers don't have that
//! index/term pair in their log, they'll say so in the `HeartbeatResponse` and
//! the leader can begin probing their logs as with append rejections.
//!
//! CLIENT REQUESTS
//! ===============
//!
//! Client requests are submitted as `Message::ClientRequest` to the local Raft
//! node. They are only processed on the leader, but followers will proxy them
//! to the leader (Raft thesis section 6.2). To avoid complications with message
//! replays (Raft thesis section 6.3), requests are not retried internally, and
//! are explicitly aborted with `Error::Abort` on leader/term changes as well as
//! elections.
//!
//! Write requests, `Request::Write`, are appended to the Raft log and
//! replicated. The leader keeps track of the request and its log index in a
//! `Write` struct. Once the command is committed and applied to the local state
//! machine, the leader looks up the write request by its log index and sends
//! the result to the client. Deterministic errors (e.g. foreign key violations)
//! are also returned to the client, but non-deterministic errors (e.g. IO
//! errors) must panic the node to avoid state divergence.
//!
//! Read requests, `Request::Read`, are only executed on the leader and don't
//! need to be replicated via the Raft log. However, to ensure linearizability,
//! the leader has to confirm with a quorum that it's actually still the leader.
//! Otherwise, it's possible that a new leader has been elected elsewhere and
//! executed writes without us knowing about it. It does this by assigning an
//! incrementing sequence number to each read, keeping track of the request in a
//! `Read` struct, and immediately sending a `Read` message with the latest
//! sequence number. Followers respond with the sequence number, and once a
//! quorum have confirmed a sequence number the read is executed and the result
//! returned to the client.
//!
//! IMPLEMENTATION CAVEATS
//! ======================
//!
//! For simplicity, toyDB implements the bare minimum for a functional and
//! correct Raft protocol, and omits several advanced mechanisms that would be
//! needed for a real production system. In particular:
//!
//! * No leases: for linearizability, every read request requires the leader to
//! confirm with followers that it's still the leader. This could be avoided
//! with a leader lease for a predefined time interval (Raft paper section 8,
//! Raft thesis section 6.3).
//!
//! * No cluster membership changes: to add or remove nodes, the entire cluster
//! must be stopped and restarted with the new configuration, otherwise it
//! risks multiple leaders (Raft paper section 6).
//!
//! * No snapshots: new or lagging nodes must be caught up by replicating and
//! replaying the entire log, instead of sending a state machine snapshot
//! (Raft paper section 7).
//!
//! * No log truncation: because snapshots aren't supported, the entire Raft
//! log must be retained forever in order to catch up new/lagging nodes,
//! leading to excessive storage use (Raft paper section 7).
//!
//! * No pre-vote or check-quorum: a node that's partially partitioned (can
//! reach some but not all nodes) can cause persistent unavailability with
//! spurious elections or heartbeats. A node rejoining after a partition can
//! also temporarily disrupt a leader. This requires additional pre-vote and
//! check-quorum protocol extensions (Raft thesis section 4.2.3 and 9.6).
//!
//! * No request retries: client requests will not be retried on leader changes
//! or message loss, and will be aggressively aborted, to ignore problems
//! related to message replay (Raft thesis section 6.3).
//!
//! * No reject hints: if a follower has a divergent log, the leader will probe
//! entries one by one until a match is found. The replication protocol could
//! instead be extended with rejection hints (Raft paper section 5.3).
Raft is fundamentally the same protocol as Paxos
and Viewstamped Replication, but an
opinionated variant designed to be simple, understandable, and practical. It is widely used in the
industry: CockroachDB, TiDB,
etcd, Consul, and many others.
Briefly, Raft elects a leader node which coordinates writes and replicates them to followers. Once a
majority (>50%) of nodes have acknowledged a write, it is considered durably committed. It is common
for the leader to also serve reads, since it always has the most recent data and is thus strongly
consistent.
A cluster must have a majority of nodes (known as a quorum)
live and connected to remain available, otherwise it will not commit writes in order to guarantee
data consistency and durability. Since there can only be one majority in the cluster, this prevents
a split brain scenario where two active
leaders can exist concurrently (e.g. during a network partition)
and store conflicting values.
The Raft leader appends writes to an ordered command log, which is then replicated to followers.
Once a majority has replicated the log up to a given entry, that log prefix is committed and then
applied to a state machine. This ensures that all nodes will apply the same commands in the same
order and eventually reach the same state (assuming the commands are deterministic). Raft itself
doesn't care what the state machine and commands are, but in toyDB's case it's SQL tables and rows
stored in an MVCC key/value store.
This diagram from the Raft paper illustrates how a Raft node receives a command from a client (1),
adds it to its log and reaches consensus with other nodes (2), then applies it to its state machine
(3) before returning a result to the client (4):
You may notice that Raft is not very scalable, since all reads and writes go via the leader node,
and every node must store the entire dataset. Raft solves replication and availability, but not
scalability. Real-world systems typically provide horizontal scalability by splitting a large
dataset across many separate Raft clusters (i.e. sharding), but this is out of scope for toyDB.
For simplicitly, toyDB implements the bare minimum of Raft, and omits optimizations described in
the paper such as state snapshots, log truncation, leader leases, and more. The implementation is
in the raft
module, and we'll walk through the main components next.
There is a comprehensive set of Raft test scripts in src/raft/testscripts/node,
which illustrate the protocol in a wide variety of scenarios.
Log Storage
Raft replicates an ordered command log consisting of raft::Entry:
/// We could omit the index in the encoded value, since it's also stored in
/// the key, but we keep it simple.
pubindex:Index,
/// The term in which the entry was added.
pubterm:Term,
/// The state machine command. None (noop) commands are used during leader
/// election to commit old entries, see section 5.4.2 in the Raft paper.
pubcommand:Option<Vec<u8>>,
}
impl encoding::ValueforEntry{}
index specifies the position in the log, and command contains the binary command to apply to the
state machine. The term identifies the leadership term in which the command was proposed: a new
term begins when a new leader election is held (we'll get back to this later).
Entries are appended to the log by the leader and replicated to followers. Once acknowledged by a
quorum, the log up to that index is committed and will never change. Entries that are not yet
committed may be replaced or removed if the leader changes.
Entries can also be appended in bulk via Log::splice, typically when entries are replicated to
followers. This also allows replacing existing uncommitted entries, e.g. after a leader change:
Raft doesn't know or care what the log commands are, nor what the state machine does with them. It
simply takes raft::Entry from the log and gives them to the state machine.
The Raft state machine is represented by the raft::State trait. Raft will ask about the last
applied entry via State::get_applied_index, and feed it newly committed entries via
State::apply. It also allows reads via State::read, but we'll get back to that later.
/// A Raft-managed state machine. Raft itself does not care what the state
/// machine is, nor what the commands and results do -- it will simply apply
/// arbitrary binary commands sequentially from the Raft log, returning an
/// arbitrary binary result to the client.
///
/// Since commands are applied identically across all nodes, they must be
/// deterministic and yield the same state and result across all nodes too.
/// Otherwise, the nodes will diverge, such that different nodes will produce
/// different results.
///
/// Write commands (`Request::Write`) are replicated and applied on all nodes
/// via `State::apply`. The state machine must keep track of the last applied
/// index and return it via `State::get_applied_index`. Read commands
/// (`Request::Read`) are only executed on a single node via `State::read` and
/// must not make any state changes.
pubtraitState:Send{
/// Returns the last applied log index from the state machine.
///
/// This must correspond to the current state of the state machine, since it
/// determines which command to apply next. In particular, a node crash may
/// result in partial command application or data loss, which must be
/// handled appropriately.
fnget_applied_index(&self) -> Index;
/// Applies a log entry to the state machine, returning a client result.
/// Errors are considered applied and propagated back to the client.
///
/// This is executed on all nodes, so the result must be deterministic: it
/// must yield the same state and result on all nodes, even if the command
/// is reapplied following a node crash.
///
/// Any non-deterministic apply error (e.g. an IO error) must panic and
/// crash the node -- if it instead returns an error to the client, the
/// command is considered applied and node states will diverge. The state
/// machine is responsible for panicing when appropriate.
///
/// The entry may contain a noop command, which is committed by Raft during
/// leader changes. This still needs to be applied to the state machine to
/// properly update the applied index, and should return an empty result.
fnapply(&mutself,entry:Entry) -> Result<Vec<u8>>;
/// Executes a read command in the state machine, returning a client result.
/// Errors are also propagated back to the client.
///
/// This is only executed on a single node, so it must not result in any
/// state changes (i.e. it must not write).
fnread(&self,command:Vec<u8>) -> Result<Vec<u8>>;
}
The state machine does not have to flush its state to durable storage after each transition; on node
crashes, the state machine is allowed to regress, and will be caught up by replaying the unapplied
log entries. It is also possible to implement a purely in-memory state machine (and in fact, toyDB
allows running the state machine with a Memory storage engine).
The state machine must take care to be deterministic: the same commands applied in the same order
must result in the same state across all nodes. This means that a command can't e.g. read the
current time or generate a random number -- these values must be included in the command. It also
means that non-deterministic errors, such as an IO error, must halt command application (in toyDB's
case, we just panic and crash the node).
In toyDB's, the state machine is an MVCC key/value store that stores SQL tables and rows, as we'll
see in the SQL Raft replication section.
Node Roles
In Raft, a node can have one out of three roles:
Leader: replicates writes to followers and serves client requests.
Follower: replicates writes from a leader.
Candidate: campaigns for leadership.
The Raft paper summarizes these roles and transitions in the following diagram (we'll discuss
leader election in detail below):
In toyDB, a node is represented by the raft::Node enum, with variants for each state:
/// A Raft node with a dynamic role. This implements the Raft distributed
/// consensus protocol, see the `raft` module documentation for more info.
///
/// The node is driven synchronously by processing inbound messages via `step()`
/// and by advancing time via `tick()`. These methods consume the node and
/// return a new one with a possibly different role. Outbound messages are sent
/// via the given `tx` channel, and must be delivered to peers or clients.
///
/// This enum is the public interface to the node, with a closed set of roles.
/// It wraps the `RawNode<Role>` types, which implement the actual node logic.
/// The enum allows ergonomic use across role transitions since it can represent
/// all roles, e.g.: `node = node.step()?`.
pubenumNode{
/// A candidate campaigns for leadership.
Candidate(RawNode<Candidate>),
/// A follower replicates entries from a leader.
Follower(RawNode<Follower>),
/// A leader processes client requests and replicates entries to followers.
Leader(RawNode<Leader>),
}
This wraps the raft::RawNode<Role> type which contains the inner node state. It is generic over
the role, and uses the typestate pattern to provide
methods and transitions depending on the node's current role. This enforces state transitions and
invariants at compile time via Rust's type system -- for example, only RawNode<Candidate> has an
into_leader() method, since only candidates can transition to leaders (when they win an election).
/// A candidate is campaigning to become a leader.
pubstructCandidate{
/// Votes received (including our own).
votes:HashSet<NodeID>,
/// Ticks elapsed since election start.
election_duration:Ticks,
/// Election timeout, in ticks.
election_timeout:Ticks,
}
We'll see what the various fields are used for in the following sections.
Node Interface and Communication
The raft::Node enum has two main methods that drive the node: tick() and step(). These consume
the current node and return a new node, possibly with a different role.
tick() advances time by a logical tick. This is used to measure the passage of time, e.g. to
trigger election timeouts or periodic leader heartbeats. toyDB uses a tick interval of 100
milliseconds (see raft::TICK_INTERVAL), and will call tick() on the node at this rate.
The envelope contains a raft::Message, an enum which encodes the Raft message protocol. We won't
dwell on the specific message types here, but discuss them invididually in the following sections.
Raft does not require reliable message delivery, so messages may be dropped or reordered at any
time, although toyDB's use of TCP provides stronger delivery guarantees.
/// Candidates campaign for leadership by soliciting votes from peers.
/// Votes will only be granted if the candidate's log is at least as
/// up-to-date as the voter.
Campaign{
/// The index of the candidate's last log entry.
last_index:Index,
/// The term of the candidate's last log entry.
last_term:Term,
},
/// Followers may vote for a single candidate per term, but only if the
/// candidate's log is at least as up-to-date as the follower. Candidates
/// implicitly vote for themselves.
CampaignResponse{
/// If true, the follower granted the candidate a vote. A false response
/// isn't necessary, but is emitted for clarity.
vote:bool,
},
/// Leaders send periodic heartbeats. This serves several purposes:
///
/// * Inform nodes about the leader, and prevent elections.
/// * Detect lost appends and reads, as a retry mechanism.
/// * Advance followers' commit indexes, so they can apply entries.
///
/// The Raft paper does not have a distinct heartbeat message, and instead
/// uses an empty AppendEntries RPC, but we choose to add one for better
/// separation of concerns.
Heartbeat{
/// The index of the leader's last log entry. The term is the leader's
/// current term, since it appends a noop entry on election win. The
/// follower compares this to its own log to determine if it's
/// up-to-date.
last_index:Index,
/// The index of the leader's last committed log entry. Followers use
/// this to advance their commit index and apply entries. It's only safe
/// to commit this if the local log matches last_index, such that the
/// follower's log is identical to the leader at the commit index.
commit_index:Index,
/// The leader's latest read sequence number in this term.
read_seq:ReadSequence,
},
/// Followers respond to leader heartbeats if they still consider it leader.
HeartbeatResponse{
/// If non-zero, the heartbeat's last_index which was matched in the
/// follower's log. Otherwise, the follower is either divergent or
/// lagging behind the leader.
match_index:Index,
/// The heartbeat's read sequence number.
read_seq:ReadSequence,
},
/// Leaders replicate log entries to followers by appending to their logs
/// after the given base entry.
///
/// If the base entry matches the follower's log then their logs are
/// identical up to it (see section 5.3 in the Raft paper), and the entries
/// can be appended -- possibly replacing conflicting entries. Otherwise,
/// the append is rejected and the leader must retry an earlier base index
/// until a common base is found.
///
/// Empty appends messages (no entries) are used to probe follower logs for
/// a common match index in the case of divergent logs, restarted nodes, or
/// dropped messages. This is typically done by sending probes with a
/// decrementing base index until a match is found, at which point the
/// subsequent entries can be sent.
Append{
/// The index of the log entry to append after.
base_index:Index,
/// The term of the base entry.
base_term:Term,
/// Log entries to append. Must start at base_index + 1.
entries:Vec<Entry>,
},
/// Followers accept or reject appends from the leader depending on whether
/// the base entry matches their log.
AppendResponse{
/// If non-zero, the follower appended entries up to this index. The
/// entire log up to this index is consistent with the leader. If no
/// entries were sent (a probe), this will be the matching base index.
match_index:Index,
/// If non-zero, the follower rejected an append at this base index
/// because the base index/term did not match its log. If the follower's
/// log is shorter than the base index, the reject index will be lowered
/// to the index after its last local index, to avoid probing each
/// missing index.
reject_index:Index,
},
/// Leaders need to confirm they are still the leader before serving reads,
/// to guarantee linearizability in case a different leader has been
/// estalished elsewhere. Read requests are served once the sequence number
/// has been confirmed by a quorum.
Read{seq:ReadSequence},
/// Followers confirm leadership at the read sequence numbers.
ReadResponse{seq:ReadSequence},
/// A client request. This can be submitted to the leader, or to a follower
/// which will forward it to its leader. If there is no leader, or the
/// leader or term changes, the request is aborted with an Error::Abort
/// ClientResponse and the client must retry.
ClientRequest{
/// The request ID. Must be globally unique for the request duration.
id:RequestID,
/// The request itself.
request:Request,
},
/// A client response.
ClientResponse{
/// The ID of the original ClientRequest.
id:RequestID,
/// The response, or an error.
response:Result<Response>,
},
}
This is an entirely synchronous and deterministic model -- the same sequence of calls on a given
node in a given initial state will always produce the same result. This is very convenient for
testing and understandability. We will see in the server section how toyDB drives the node on a
separate thread, provides a network transport for messages, and ticks it at regular intervals.
Leader Election and Terms
In the steady state, Raft simply has a leader which replicates writes to followers. But to reach
this steady state, we must elect a leader, which is where much of the subtle complexity lies. See
the Raft paper for comprehensive details and safety arguments, we'll summarize it briefly below.
Raft divides time into terms. The term is a monotonically increasing number starting at 1. There
can only be one leader in a term (or none if an election fails), and the term can never regress.
Replicated commands belong to the specific term under which they were proposed.
/// A leader term number. Increases monotonically on elections.
pubtypeTerm = u64;
Let's walk through an election, where we bootstrap a brand new, empty toyDB cluster with 3 nodes.
Nodes are initialized by calling Node::new(). Since this is a new cluster, they are given an empty
raft::Log and raft::State, at term 0. Nodes start with role Follower, but without a leader.
// Apply any pending entries following restart. State machine writes are
// not flushed to durable storage, so a tail of writes may be lost if
// the host crashes or restarts. The Raft log is durable, so we can
// always recover the state from it. We reapply any missing entries here
// if that should happen.
node.maybe_apply()?;
Ok(node)
}
Now, nothing really happens for a while, as the nodes are waiting to maybe hear from an existing
leader (there is none). Every 100 ms we call tick(), until we reach election_timeout:
Notice how new() set election_timeout to a random value (in the range ELECTION_TIMEOUT_RANGE
of 10-20 ticks, i.e. 1-2 seconds). If all nodes had the same timeout, they would likely campaign for
leadership simultaneously, resulting in an election tie -- Raft uses randomized election timeouts to
avoid such ties.
Once a node reaches election_timeout it transitions to role Candidate:
assert!(node.role.votes.contains(&node.id),"candidate did not vote for self");
assert_ne!(term,0,"candidate can't have term 0");
assert_eq!(vote,Some(node.id),"log vote does not match self");
Ok(node)
}
When it becomes a candidate it campaigns for leadership by increasing its term to 1, voting for
itself, and sending Message::Campaign to all peers asking for their vote:
In Raft, the term can't regress, and a node can only cast a single vote in each term (even across
restarts), so both of these are persisted to disk via Log::set_term_vote().
When the two other nodes (still in state Follower) receive the Message::Campaign asking for a
vote, they will first increase their term to 1 (since this is a newer term than their local term 0):
They then grant the vote since they haven't yet voted for anyone else in term 1. They persist the
vote to disk via Log::set_term_vote() and return a Message::CampaignResponse { vote: true } to
the candidate:
They also check that the candidate's log is at least as long as theirs, which is trivially true in
this case since the log is empty. This is necessary to ensure that a leader has all committed
entries (see section 5.4.1 in the Raft paper).
When the candidate receives the Message::CampaignResponse it records the vote from each node. Once
it has a quorum (in this case 2 out of 3 votes including its own vote) it becomes leader in term 1:
// If we received a vote, record it. If the vote gives us quorum,
// assume leadership.
Message::CampaignResponse{vote:true} => {
self.role.votes.insert(msg.from);
ifself.role.votes.len() >= self.quorum_size(){
returnOk(self.into_leader()?.into());
}
}
When it becomes leader, it sends a Message::Heartbeat to all peers to tell them it is now the
leader in term 1. It also appends an empty entry to its log and replicates it, but we will ignore
this for now (see section 5.4.2 in the Raft paper for why).
The followers record when they last received any message from the leader (including heartbeats), and
will hold a new election if they haven't heard from the leader in an election timeout (e.g. due to a
leader crash or network partition):
This entire process is illustrated in the test script election,
along with several other test scripts that show e.g. election ties,
contested elections,
and other scenarios:
Once a leader has been elected, we can submit read and write requests to it. This is done by
stepping a Message::ClientRequest into the node using the local node ID, with a unique request ID
(toyDB uses UUIDv4), and waiting for an outbound response message with the same ID:
The requests and responses themselves are arbitrary binary data which is interpreted by the state
machine. For our purposes here, let's pretend the requests are:
The fundamental difference between read and write requests are that write requests are replicated
through Raft and executed on all nodes, while read requests are only executed on the leader without
being appended to the log. It would be possible to execute reads on followers too, for load
balancing, but these reads would be eventually consistent and thus violate linearizability, so toyDB
only executes reads on the leader.
If a request is submitted to a follower, it will be forwarded to the leader and the response
forwarded back to the client (distinguished by the sender/recipient node ID -- a local client always
uses the local node ID):
For simplicity, we cancel the request with Error::Abort if a request is submitted to a candidate,
and similarly if a follower changes its role to candidate or discovers a new leader. We could have
held on to these and redirected them to a new leader, but we keep it simple and ask the client to
retry.
We'll look at the actual read and write request processing next.
Write Replication and Application
When the leader receives a write request, it proposes the command for replication to followers. It
keeps track of the in-flight write and its log entry index in writes, such that it can respond to
the client with the command result once the entry has been committed and applied.
/// Leaders replicate log entries to followers by appending to their logs
/// after the given base entry.
///
/// If the base entry matches the follower's log then their logs are
/// identical up to it (see section 5.3 in the Raft paper), and the entries
/// can be appended -- possibly replacing conflicting entries. Otherwise,
/// the append is rejected and the leader must retry an earlier base index
/// until a common base is found.
///
/// Empty appends messages (no entries) are used to probe follower logs for
/// a common match index in the case of divergent logs, restarted nodes, or
/// dropped messages. This is typically done by sending probes with a
/// decrementing base index until a match is found, at which point the
/// subsequent entries can be sent.
Append{
/// The index of the log entry to append after.
base_index:Index,
/// The term of the base entry.
base_term:Term,
/// Log entries to append. Must start at base_index + 1.
entries:Vec<Entry>,
},
However, sometimes followers may be lagging behind the leader (e.g. after a crash), or their log may
have diverged from the leader (e.g. unsuccessful proposals from a stale leader after a network
partition). To handle these cases, the leader tracks the replication progress of each follower as
raft::Progress:
/// Per-follower replication progress (in this term).
structProgress{
/// The highest index where the follower's log is known to match the leader.
/// Initialized to 0, increases monotonically.
match_index:Index,
/// The next index to replicate to the follower. Initialized to
/// last_index+1, decreased when probing log mismatches. Always in
/// the range [match_index+1, last_index+1].
///
/// Entries not yet sent are in the range [next_index, last_index].
/// Entries not acknowledged are in the range [match_index+1, next_index).
next_index:Index,
/// The last read sequence number confirmed by this follower. To avoid stale
/// reads on leader changes, a read is only served once its sequence number
/// is confirmed by a quorum.
read_seq:ReadSequence,
}
We'll gloss over these cases here (see the Raft paper and the code in raft::Progress and
maybe_send_append() for details). In the steady state, where each entry is successfully appended
and replicated one at a time, maybe_send_append() will fall through to the bottom and send a
single entry:
The Message::Append contains the index/term of the entry immediately before the new entry as
base_index and base_term. If the follower's log also contains an entry with this index and term
then its log is guaranteed to match (be equal to) the leader's log up to this entry (see section 5.3
in the Raft paper). The follower can then append the new log entry and return a
Message::AppendResponse confirming that the entry was appended and that its log matches the
leader's log up to match_index:
Once a quorum of nodes (in our case 2 out of 3 including the leader) have the entry in their log,
the leader can commit the entry and apply it to the state machine. It also looks up the in-flight
write request from writes and sends the command result back to the client as
Message::ClientResponse:
// We can only safely commit an entry from our own term (see section
// 5.4.2 in Raft paper).
matchself.log.get(commit_index)? {
Some(entry)if entry.term == self.term() => {}
Some(_) => returnOk(old_index),
None => panic!("commit index {commit_index} missing"),
}
// Commit entries.
self.log.commit(commit_index)?;
// Apply entries and respond to clients.
let term = self.term();
letmut iter = self.log.scan_apply(self.state.get_applied_index());
whileletSome(entry) = iter.next().transpose()? {
debug!("Applying {entry:?}");
let write = self.role.writes.remove(&entry.index);
let result = self.state.apply(entry);
ifletSome(Write{ id,from: to }) = write {
let message = Message::ClientResponse{ id,response: result.map(Response::Write)};
Self::send_via(&self.tx,Envelope{from:self.id, term, to, message })?;
}
}
drop(iter);
// If the commit term changed, there may be pending reads waiting for us
// to commit and apply an entry from our own term. Execute them.
if old_term != self.term(){
self.maybe_read()?;
}
Ok(commit_index)
}
The leader will also propagate the new commit index to followers via the next heartbeat, so that
they can also apply any pending log entries to their state machine. This isn't strictly necessary,
since reads are executed on the leader and nodes have to apply pending entries before becoming
leaders, but we do it anyway so that they don't fall too far behind on application.
For linearizable (aka strongly consistent) reads, we must execute read requests on the leader, as
mentioned above. However, this is not sufficient: under e.g. a network partition, a node may think
it's still the leader while in fact a different leader has been elected elsewhere (in a later term)
and executed writes there.
To handle this case, the leader must confirm that it is still the leader for each read, by sending a
Message::Read to its followers containing a read sequence number. Only if a quorum confirms that
it is still the leader can the read be executed. This incurs an additional network roundtrip, which
is clearly inefficient, so real-world systems often use leader leases instead (see section 6.4.1 of
the Raft thesis, not the paper) -- but it's fine for toyDB.
/// Leaders need to confirm they are still the leader before serving reads,
/// to guarantee linearizability in case a different leader has been
/// estalished elsewhere. Read requests are served once the sequence number
/// has been confirmed by a quorum.
Read{seq:ReadSequence},
/// Followers confirm leadership at the read sequence numbers.
ReadResponse{seq:ReadSequence},
When the leader receives the read request, it increments the read sequence number, stores the
pending read request in reads, and sends a Message::Read to all followers:
When the followers receive the Message::Read, they simply respond with a Message::ReadResponse
if it's from their current leader (messages from stale terms are ignored):
When the leader receives the Message::ReadResponse it records it in the peer's Progress, and
executes the read once a quorum have confirmed the sequence number: