Add Extended System-Level TLA+ Specification for Raft Core Module by Qian-Cheng-nju · Pull Request #381 · etcd-io/raft

Qian-Cheng-nju · 2026-02-04T08:13:54Z

Summary

(We're researchers from NJU/MSR/UIUC/UBC working on writing system specifications in TLA+. We've been using our tool on etcd/raft and wanted to contribute our artifacts back, if the community finds this useful! We're willing to help maintain this spec if accepted.)

This PR adds a system-level TLA+ spec extending the existing spec in tla/. The two specs are intended to co-exist: the original spec provides a high-level description of the Raft protocol for understanding and verifying core safety properties, while our spec provides a low-level description of implementation details and enables verification of finer-grained system states. The spec models the raft library's core logic as implemented in code, covering leader election, log replication, snapshotting, membership changes, and progress tracking with flow control. We abstract away I/O, e.g., network transport and persistent storage are not modeled. The spec focuses on what the raft module decides, not how it communicates or persists those decisions.

This spec can serve as a more detailed formal document of etcd’s raft core. It is also useful to expose system-level bugs and issues (through model checking), as we show below. We are willing to maintain this spec upon code changes in the raft library's core logic.

New Features

This spec models important system-level details, such as flow control, snapshot state machines, and joint consensus, that are not included in the original spec, but are important to the correctness of the system. Specifically:
For configuration changes, the original spec only models single-node changes. Our spec models the full joint consensus protocol (ConfChangeV2) including pendingConfIndex tracking and auto-leave mechanism, which are necessary to reproduce bugs like #12136 and bd3c759.

In extended spec, the leader maintains a progress state machine for each follower (StateProbe, StateReplicate, StateSnapshot) and uses flow control (inflights, MsgAppFlowPaused) to avoid overwhelming slow followers.
For snapshot, the original spec models the snapshot state machine including pendingSnapshot tracking and the state transitions when snapshots succeed or fail. In the original spec, snapshot is modeled as a stateless message send. There is no StateSnapshot progress state, no pendingSnapshot tracking, and no handling of success or failure.

Trace Validation Details

We validated the spec against 37 traces generated from etcd/raft's [test suite](https://github.com/etcd-io/raft/tree/main/testdata), with additional test cases written for more comprehensive trace coverage.

The trace validation achieved 94.43% statement coverage (407 out of 431 statements). This high coverage indicates that the spec closely matches the actual code behavior. A TLA+ statement is considered covered if the execution path passes through it at least once across all traces. The uncovered 5.57% represents edge cases requiring specific timing sequences that are difficult to trigger through standard testing.

The instrumentation code and harness are in #382.

The spec defines 88 invariants, derived from the paper, code, and past issues/bugs as regression tests.

Issues Found

We find the following three issues.

1. Probe-mode flow-ontrol issue

Code: raft.go:maybeSendAppend()

Invariant violated: ProbeLimitInv

ProbeLimitInv ==
   \A i \in Server : \A j \in Server :
       (state[i] = Leader /\ progressState[i][j] = StateProbe)
           => InflightsCount(i, j) <= 1

This invariant encodes the intent of StateProbe: when probing a potentially slow/unreachable follower, the leader should have at most one message in-flight*.

Our model checking shows that the maybeSendAppend function has a sendIfEmpty parameter. When sendIfEmpty=true and the entry list is empty, the function bypasses the IsPaused() check, allowing multiple empty MsgApp messages to accumulate in-flight.

We realize that this is a known issue. The code has a TODO comment noting that MsgAppFlowPaused should be set unconditionally in StateProbe, but currently it only triggers when entries > 0, allowing empty messages to bypass the flow control.

2. Documentation ambiguity: EnterJoint() quorum overlap

Doc: doc.go:262-270

The doc states that "old and new configurations are guaranteed to overlap", but this only holds for Simple() (single-node changes via symdiff > 1 check). EnterJoint() lacks this check, allowing multi-node changes where quorums may not overlap (e.g., {1,2,3} → {3}). This violates QuorumLogInv.

3. DisableConfChangeValidation safety violation

Code: raft.go:258-269

When DisableConfChangeValidation is enabled, QuorumLogInv can also be violated. This option disables propose-time validation checks, allowing invalid configuration changes that can lead to committed entries not existing on a quorum (e.g., proposing LeaveJoint before new members have synced the log). Our spec formally documents what safety properties are lost when this option is enabled.

Fidelity

To ensure that the spec models the system in the right level and to show that maintaining such a system-level spec could find bugs, we reverted the fixes of 5 historical bugs and updated the spec based on the code changes.

Taking Bug bd3c759 as an example, this bug occurs when the leader is in a joint configuration with autoLeave=true. The original code failed to check whether a previous leave-joint proposal was still pending, allowing multiple leave-joint entries to be proposed.

We model this bug by removing two guards in the spec:

ProposeLeaveJoint(i) ==
   /\ state[i] = Leader
   /\ IsJointConfig(i)
   /\ config[i].autoLeave = TRUE
   \* BUG: Removed check - original: /\ pendingConfChangeIndex[i] = 0
   /\ Replicate(...)
   \* BUG: Don't update pendingConfChangeIndex
   /\ UNCHANGED <<pendingConfChangeIndex>>

We define SinglePendingLeaveJointInv to encode that there should be at most one pending leave-joint entry in the uncommitted log. Running TLC in simulation mode detects the violation within minutes.

We show that the spec could help detect all these bugs through model checking. The original spec couldn't detect any of them, as it lacks the system-level concepts required: log compaction, auto-leave mechanism, asynchronous apply, and progress state machine.

Bug	Issue/Commit	Invariant
`76f1249`	commit	`AppendEntriesPrevLogTermValidInv`
`bd3c759`	commit	`SinglePendingLeaveJointInv`
#12136	issue	`JointStateAutoLeavePossibleInv`
#7280	issue	`ElectionConfigAppliedInv`
#124	issue	`SnapshotTransitionCorrectInv`

k8s-ci-robot · 2026-02-04T08:14:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Qian-Cheng-nju
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-02-04T08:14:06Z

Hi @Qian-Cheng-nju. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius · 2026-02-04T09:14:59Z

Thanks for sending PR. However, the etcd community doesn't have much expertise with TLA+.

Only person that could review it would be @joshuazh-x, who contributed TLA+ spec for etcd, however he is no longer active and I don't know they will have time.

Qian-Cheng-nju and others added 2 commits February 3, 2026 13:20

tla: add system-level TLA+ specification for raft core

cebab3f

Merge branch 'etcd-io:main' into pr1-spec

51da9c7

k8s-ci-robot added the needs-ok-to-test label Feb 4, 2026

k8s-ci-robot added the size/XXL label Feb 4, 2026

Qian-Cheng-nju mentioned this pull request Feb 4, 2026

Add Trace Validation Harness for Extended TLA+ Spec #382

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Extended System-Level TLA+ Specification for Raft Core Module#381

Add Extended System-Level TLA+ Specification for Raft Core Module#381
Qian-Cheng-nju wants to merge 2 commits intoetcd-io:mainfrom
specula-org:pr1-spec

Qian-Cheng-nju commented Feb 4, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

serathius commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Qian-Cheng-nju commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Features

Issues Found

1. Probe-mode flow-ontrol issue

2. Documentation ambiguity: EnterJoint() quorum overlap

3. DisableConfChangeValidation safety violation

Fidelity

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

serathius commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Qian-Cheng-nju commented Feb 4, 2026 •

edited

Loading