Skip to content

Add Extended System-Level TLA+ Specification for Raft Core Module#381

Open
Qian-Cheng-nju wants to merge 2 commits intoetcd-io:mainfrom
specula-org:pr1-spec
Open

Add Extended System-Level TLA+ Specification for Raft Core Module#381
Qian-Cheng-nju wants to merge 2 commits intoetcd-io:mainfrom
specula-org:pr1-spec

Conversation

@Qian-Cheng-nju
Copy link
Contributor

@Qian-Cheng-nju Qian-Cheng-nju commented Feb 4, 2026

Summary

(We're researchers from NJU/MSR/UIUC/UBC working on writing system specifications in TLA+. We've been using our tool on etcd/raft and wanted to contribute our artifacts back, if the community finds this useful! We're willing to help maintain this spec if accepted.)

This PR adds a system-level TLA+ spec extending the existing spec in tla/. The two specs are intended to co-exist: the original spec provides a high-level description of the Raft protocol for understanding and verifying core safety properties, while our spec provides a low-level description of implementation details and enables verification of finer-grained system states. The spec models the raft library's core logic as implemented in code, covering leader election, log replication, snapshotting, membership changes, and progress tracking with flow control. We abstract away I/O, e.g., network transport and persistent storage are not modeled. The spec focuses on what the raft module decides, not how it communicates or persists those decisions.

This spec can serve as a more detailed formal document of etcd’s raft core. It is also useful to expose system-level bugs and issues (through model checking), as we show below. We are willing to maintain this spec upon code changes in the raft library's core logic.

New Features

This spec models important system-level details, such as flow control, snapshot state machines, and joint consensus, that are not included in the original spec, but are important to the correctness of the system. Specifically:
For configuration changes, the original spec only models single-node changes. Our spec models the full joint consensus protocol (ConfChangeV2) including pendingConfIndex tracking and auto-leave mechanism, which are necessary to reproduce bugs like #12136 and bd3c759.

In extended spec, the leader maintains a progress state machine for each follower (StateProbe, StateReplicate, StateSnapshot) and uses flow control (inflights, MsgAppFlowPaused) to avoid overwhelming slow followers.
For snapshot, the original spec models the snapshot state machine including pendingSnapshot tracking and the state transitions when snapshots succeed or fail. In the original spec, snapshot is modeled as a stateless message send. There is no StateSnapshot progress state, no pendingSnapshot tracking, and no handling of success or failure.

Trace Validation Details We validated the spec against 37 traces generated from etcd/raft's [test suite](https://github.com/etcd-io/raft/tree/main/testdata), with additional test cases written for more comprehensive trace coverage.

The trace validation achieved 94.43% statement coverage (407 out of 431 statements). This high coverage indicates that the spec closely matches the actual code behavior. A TLA+ statement is considered covered if the execution path passes through it at least once across all traces. The uncovered 5.57% represents edge cases requiring specific timing sequences that are difficult to trigger through standard testing.

The instrumentation code and harness are in #382.

The spec defines 88 invariants, derived from the paper, code, and past issues/bugs as regression tests.

Issues Found

We find the following three issues.

1. Probe-mode flow-ontrol issue

Code: raft.go:maybeSendAppend()

Invariant violated: ProbeLimitInv

ProbeLimitInv ==
   \A i \in Server : \A j \in Server :
       (state[i] = Leader /\ progressState[i][j] = StateProbe)
           => InflightsCount(i, j) <= 1

This invariant encodes the intent of StateProbe: when probing a potentially slow/unreachable follower, the leader should have at most one message in-flight*.

Our model checking shows that the maybeSendAppend function has a sendIfEmpty parameter. When sendIfEmpty=true and the entry list is empty, the function bypasses the IsPaused() check, allowing multiple empty MsgApp messages to accumulate in-flight.

We realize that this is a known issue. The code has a TODO comment noting that MsgAppFlowPaused should be set unconditionally in StateProbe, but currently it only triggers when entries > 0, allowing empty messages to bypass the flow control.

2. Documentation ambiguity: EnterJoint() quorum overlap

Doc: doc.go:262-270

The doc states that "old and new configurations are guaranteed to overlap", but this only holds for Simple() (single-node changes via symdiff > 1 check). EnterJoint() lacks this check, allowing multi-node changes where quorums may not overlap (e.g., {1,2,3}{3}). This violates QuorumLogInv.

3. DisableConfChangeValidation safety violation

Code: raft.go:258-269

When DisableConfChangeValidation is enabled, QuorumLogInv can also be violated. This option disables propose-time validation checks, allowing invalid configuration changes that can lead to committed entries not existing on a quorum (e.g., proposing LeaveJoint before new members have synced the log). Our spec formally documents what safety properties are lost when this option is enabled.

Fidelity

To ensure that the spec models the system in the right level and to show that maintaining such a system-level spec could find bugs, we reverted the fixes of 5 historical bugs and updated the spec based on the code changes.

Taking Bug bd3c759 as an example, this bug occurs when the leader is in a joint configuration with autoLeave=true. The original code failed to check whether a previous leave-joint proposal was still pending, allowing multiple leave-joint entries to be proposed.

We model this bug by removing two guards in the spec:

ProposeLeaveJoint(i) ==
   /\ state[i] = Leader
   /\ IsJointConfig(i)
   /\ config[i].autoLeave = TRUE
   \* BUG: Removed check - original: /\ pendingConfChangeIndex[i] = 0
   /\ Replicate(...)
   \* BUG: Don't update pendingConfChangeIndex
   /\ UNCHANGED <<pendingConfChangeIndex>>

We define SinglePendingLeaveJointInv to encode that there should be at most one pending leave-joint entry in the uncommitted log. Running TLC in simulation mode detects the violation within minutes.

We show that the spec could help detect all these bugs through model checking. The original spec couldn't detect any of them, as it lacks the system-level concepts required: log compaction, auto-leave mechanism, asynchronous apply, and progress state machine.

Bug Issue/Commit Invariant
76f1249 commit AppendEntriesPrevLogTermValidInv
bd3c759 commit SinglePendingLeaveJointInv
#12136 issue JointStateAutoLeavePossibleInv
#7280 issue ElectionConfigAppliedInv
#124 issue SnapshotTransitionCorrectInv

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Qian-Cheng-nju
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @Qian-Cheng-nju. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

Thanks for sending PR. However, the etcd community doesn't have much expertise with TLA+.

Only person that could review it would be @joshuazh-x, who contributed TLA+ spec for etcd, however he is no longer active and I don't know they will have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments