Kvrocks gracefully failover design proposal #3218

yuzegao · 2025-10-07T10:21:25Z

yuzegao
Oct 7, 2025

Hi All,
In #2848, We plan to support the gracefully(No-data-loss) failover command.

Summary

Kvrocks today lacks a production-ready failover workflow that guarantees no data loss during manual promote master. This proposal compares two feasible approaches:

In-process failover - Extends Kvrocks core to perform coordinated failover (master suspends writes, master and slave roles reverse, master releases writes).
Controller-based failover - Implement an external HA controller (kvrocks-controller) that uses native kvrocks commands (INFO, CLIENT PAUSE, CLUSTERX SETNODES, etc.) to coordinate safe upgrades. Incidentally, CLIENT PAUSE WRITE/CLIENT UNPAUSE needs to be implemented.

Both approaches are viable. This document outlines the design, tradeoffs, hoping to spark community discussion on which approach to adopt. In addition, this solution does not consider the processing of non-cluster mode.

Goals

Primary goal: Provide a failover mechanism to prevent data loss during manual maintenance (node migration, process upgrade).

Option A — In-process (Node-local) Failover

Concept

Enhance the Kvrocks server binary so that a given slave within a shard cooperates with the master to complete the master-slave reversal steps: checking the master-slave replication offset, pausing the master from writing, catching up the master and slave offsets, reversing the master-slave roles, and releasing the old master from writing.

Core Components/Steps

The replica node receives the FAILOVER command and notifies the master node to initiate a master-slave reversal.
The master node checks the slave node's offset, suspends write operations, and, upon confirming that the slave node's offset is the same, updates the node's role to slave.
The slave node is notified to switch to the master role and take over the slots. If this step fails, the second step of the role change is rolled back.
Write operations are resumed, and all subsequent read and write requests received by the old master are forwarded to the new master node to ensure data consistency.
After the controller confirms the failover status is successful, it executes clusterx setnodes to update the cluster topology.

Advantages

Autonomous strategy: Fewer interactive steps means less risk of anomalies.
Lowest latency: Fewer round trips to the external controller.

Disadvantages/Risks

Largely invasive changes to the core; increased code complexity and testing interface.
Operational debugging and observability become more difficult (failure logic within the process).

Option B — Controller-Based Failover

Concept

An external, highly available kvrocks controller is responsible for master-slave rollovers and cluster topology updates. The controller detects replication offsets between master and slave nodes, pauses writes to the master, updates the master and slave roles and cluster topology, and resumes writes to the old master. The controller also performs trade-off procedures for exceptions such as retries and rollbacks.

Core Components/Steps

The controller checks the replication lag between the target slave and the master (Info replication).
If the lag is manageable, the controller pauses writes to the master (CLIENT PAUSE WRITE needs to be implemented).
When the replication lag between the master and slave nodes is consistent, the controller performs a slave role reversal and a topology update (CLUSTERX SETNODES).
Performs a master role reversal and a topology update.
Resumes the paused writes on the old master (CLIENT UNPAUSE needs to be implemented).
Continues to update the topology information of other nodes in the cluster.

Advantages

Operationally friendly: The controller is auditable and versioned, making it easier to iterate.
Low risk: Changes to kvrocks are small and testable.

Disadvantages/Tradeoffs

The controller and kvrocks require multiple rounds of interaction, requiring handling of multiple exceptions (retries and rollbacks).
Slightly higher failover latency (controller coordination); this is acceptable in most environments. 3. Controllers must be highly reliable – becoming critical infrastructure (but easier to monitor and maintain than core logic).

git-hulk · 2025-10-08T08:10:58Z

git-hulk
Oct 8, 2025
Collaborator

Hi @yuzegao, thanks for your detailed solution. I personally preferred solution A because this failover behavior can also be used in master-replica mode, and its consistency with Redis would make it easier for users to understand.

0 replies

ethervoid · 2025-10-09T10:40:07Z

ethervoid
Oct 9, 2025

I have one question about the solution A. I can see in the diagram that performs a CLUSTERX FAILOVER but I though, from the issues this comes from that is not related to clustering but to master-replica failover performed with redis sentinel with the failover command. That solution works with redis-sentinel command?

0 replies

yuzegao · 2025-10-26T07:47:30Z

yuzegao
Oct 26, 2025
Author

@git-hulk @ethervoid
The above proposal does not consider lossless failover in Redis Sentinel mode. When Redis uses the Redis Sentinel high-availability architecture, executing the SENTINEL failover command does not guarantee data loss. Redis Sentinel simply uses INFO REPLICAION to ensure that a slave instance with the smallest offset gap is selected, and then executes the SLAVEOF NO ONE command. See the following timing diagram for details.

If manual failover commands for kvrocks are required to support data loss-free failover in both Redis Sentinel and cluster modes, modifications to the Redis Sentinel source code are unavoidable (directly modifying the behavior of the SLAVEOF NO ONE command may not be appropriate):

Implement the SLAVEOF NO ONE GRACEFUL command in kvrocks to ensure data consistency between the master and slave instances during a master-slave failover.
When executing the SENTINEL failover command on a Redis Sentinel, if the master node is reachable, the SLAVEOF NO ONE GRACEFUL command is used to initiate a master-slave failover without data loss.
kvrocks implements the "CLUSTERX FAILOVER" command, which works with the kvrocks controller to reverse the master-slave roles.
The SLAVEOF NO ONE GRACEFUL and CLUSTERX FAILOVER commands are expected to maintain consistent internal interaction mechanisms within kvrocks. I am currently studying the Kvrocks source code to evaluate potential modifications.

1 reply

git-hulk Oct 27, 2025
Collaborator

@yuzegao Thanks for your information.

Implement the SLAVEOF NO ONE GRACEFUL command in kvrocks to ensure data consistency between the master and slave instances during a master-slave failover

In my opinion, the main issue is that Redis Sentinel doesn't use sentinel failover to switch the master. And it is now also having the data loss issue while failovering via the sentinel. And it's quite strange to require that the slave MUST be consistent with the old master.

kvrocks implements the "CLUSTERX FAILOVER" command, which works with the kvrocks controller to reverse the master-slave roles.

I prefer implementing the FAILOVER command in Kvrocks only, and then updating the cluster topology information via the controller once it succeeds.

yuzegao · 2025-10-28T12:00:35Z

yuzegao
Oct 28, 2025
Author

This seems more reasonable. I will develop it in cluster mode to implement lossless failover. If you have better suggestions, please let me know.

0 replies

ethervoid · 2025-10-28T12:14:18Z

ethervoid
Oct 28, 2025

Understood sounds good to me, having a kvrocks FAILOVER command we can use to perform controlled manual failovers without losing data. Thank you for the clarification

0 replies

greatsharp · 2025-10-31T04:10:14Z

greatsharp
Oct 31, 2025

After taking a look at Redis 'cluster failover' and 'client pause/unpause' commands, I prefer option A.
If you need some help in kvrocks controller implementation, call me please.

0 replies

greatsharp · 2025-10-31T04:40:31Z

greatsharp
Oct 31, 2025

@zhixinwen 关注到你提交了wait命令和主从复制时的ack机制，想听听大佬的想法。

1 reply

zhixinwen Oct 31, 2025

I did not read the full post, but if I am implementing this I would do the following:

Add new command that does two things: first it would block write on master, second it would wait for replica to fully catch up. The existing WAIT + ACK mechanism can be reused for this.
An optional timeout can be added to the command above. If replica catch up within the timeout, then the command is treated as successful. If not, then controller would treat the graceful failover as failed. WAIT timeout can be used for this.
Once the command succeeds, which means replica has caught up, controller can start the failover command. The old master will clear up its write forbidden mark once it becomes replica.

There shouldn't be a lot of work needed to make it work.

yuzegao · 2025-12-14T05:13:59Z

yuzegao
Dec 14, 2025
Author

I have completed the failover feature development and submitted a pull request (PR)：#3295.
The implementation mainly references the clusterx migrate command, with minimal modifications to the original KVROS logic.
However, to allow the slave to make write requests after slave failover, some tricks are involved. The sequence diagram is as follows:

@greatsharp That would be great if you could help with the development of the controller failover functionality.

0 replies

Kvrocks gracefully failover design proposal #3218

Uh oh!

yuzegao Oct 7, 2025

Summary

Goals

Option A — In-process (Node-local) Failover

Concept

Core Components/Steps

Advantages

Disadvantages/Risks

Option B — Controller-Based Failover

Concept

Core Components/Steps

Advantages

Disadvantages/Tradeoffs

Replies: 8 comments · 2 replies

Uh oh!

git-hulk Oct 8, 2025 Collaborator

Uh oh!

ethervoid Oct 9, 2025

Uh oh!

yuzegao Oct 26, 2025 Author

Uh oh!

git-hulk Oct 27, 2025 Collaborator

Uh oh!

yuzegao Oct 28, 2025 Author

Uh oh!

ethervoid Oct 28, 2025

Uh oh!

Uh oh!

greatsharp Oct 31, 2025

Uh oh!

greatsharp Oct 31, 2025

Uh oh!

zhixinwen Oct 31, 2025

Uh oh!

yuzegao Dec 14, 2025 Author

yuzegao
Oct 7, 2025

Replies: 8 comments 2 replies

git-hulk
Oct 8, 2025
Collaborator

ethervoid
Oct 9, 2025

yuzegao
Oct 26, 2025
Author

git-hulk Oct 27, 2025
Collaborator

yuzegao
Oct 28, 2025
Author

ethervoid
Oct 28, 2025

greatsharp
Oct 31, 2025

greatsharp
Oct 31, 2025

yuzegao
Dec 14, 2025
Author