[POC - do not review yet] chore: almost mutex free replication #5758

kostasrim · 2025-09-01T16:35:52Z

Do not review.

src/server/server_family.cc

kostasrim · 2025-09-01T16:40:56Z

src/server/server_family.cc

      };
-      fb2::LockGuard lk(replicaof_mu_);
+      // Deep copy because tl_replica might be overwritten inbetween
+      auto replica = tl_replica;


Bread and butter no1 of this PR. Bye bye blocking info command because of the mutex.

TODO worth adding a test

kostasrim · 2025-09-01T16:42:11Z

src/server/server_family.cc

-  // If we are called by "Replicate", tx will be null but we do not need
-  // to flush anything.
+
+  util::fb2::LockGuard lk(replicaof_mu_);


Bread and butter no2. No more loading state prematurely. We only get into loading_state if we are doing full sync. Otherwise, no state change at all.

kostasrim · 2025-09-01T16:43:08Z

TODO: move this to ReplicaOfInternalV2 such that the new algorithm is easy to review. IMO with the changes and this gh diff is unreviewable 😱

Signed-off-by: kostas <[email protected]>

kostasrim · 2025-09-02T15:58:41Z

tests/dragonfly/replication_test.py


    logging.info(f"succeses: {num_successes}")
-    assert COMMANDS_TO_ISSUE > num_successes, "At least one REPLICAOF must be cancelled"
+    assert COMMANDS_TO_ISSUE == num_successes


The new algorithm does not use two phase locking so the following is no longer a possibility:

client 1 -> REPLICAOF -> Locks the mutex -> updates replica_ to new_replica -> releases the mutex -> calls replica_->Start()

client 2 -> REPLICAOF -> same as (1) but first calls replica_->Stop() -> releases the mutex

client 1 -> REPLICAOF command grabs the second lock to the mutex, observes that the context got cancelled because of step (2) and boom returns "replication cancelled"

This can not happen anymore because we lock only once and atomically update everything including stopping the previous replica. So by the time (2) grabs the lock in the example above, the previous REPLICAOF command is not in some intermediate state. To observe that indeed we cancelled, we should read the logs and see ("Stopping replication") COMMANDS_TO_ISSUE - 1 times + 1 (because of the Shutdown() at the end)

Bonus points is that I suspect we might be able to also solve #4685 but I will need to follow up with that

Signed-off-by: kostas <[email protected]>

kostasrim self-assigned this Sep 1, 2025

kostasrim force-pushed the kpr3 branch from 466f80c to 787e0c0 Compare September 1, 2025 16:36

kostasrim commented Sep 1, 2025

View reviewed changes

src/server/server_family.cc Outdated Show resolved Hide resolved

kostasrim commented Sep 1, 2025

View reviewed changes

src/server/server_family.cc Outdated Show resolved Hide resolved

kostasrim commented Sep 1, 2025

View reviewed changes

kostasrim force-pushed the kpr3 branch 3 times, most recently from 38640b3 to 76598f2 Compare September 2, 2025 13:28

chore: almost mutex free replication

bf087da

Signed-off-by: kostas <[email protected]>

kostasrim force-pushed the kpr3 branch from 76598f2 to bf087da Compare September 2, 2025 15:44

kostasrim commented Sep 2, 2025

View reviewed changes

clean up

95ca692

kostasrim force-pushed the kpr3 branch from c3616a8 to 95ca692 Compare September 3, 2025 06:36

fix race condition in ProtocolClient::Socket()

6ed0f0c

Signed-off-by: kostas <[email protected]>

kostasrim force-pushed the kpr3 branch from dfb30f6 to b932b26 Compare September 3, 2025 11:53

one to rule them all

4b48d9a

kostasrim force-pushed the kpr3 branch from b932b26 to 4b48d9a Compare September 3, 2025 12:12

kostasrim added 3 commits September 4, 2025 09:37

proper clean up semantics

7db60eb

cluster: do not close socket on destructor

6f06bba

cluster socket!

f932141

kostasrim mentioned this pull request Sep 8, 2025

chore: new ReplicaOf algorithm #5774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC - do not review yet] chore: almost mutex free replication #5758

[POC - do not review yet] chore: almost mutex free replication #5758

Uh oh!

kostasrim commented Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

kostasrim Sep 1, 2025

Uh oh!

kostasrim Sep 1, 2025

Uh oh!

kostasrim Sep 1, 2025

Uh oh!

kostasrim commented Sep 1, 2025 •

edited

Loading

Uh oh!

kostasrim Sep 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

[POC - do not review yet] chore: almost mutex free replication #5758

Are you sure you want to change the base?

[POC - do not review yet] chore: almost mutex free replication #5758

Uh oh!

Conversation

kostasrim commented Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

kostasrim Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

kostasrim Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

kostasrim Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

kostasrim commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kostasrim Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kostasrim commented Sep 1, 2025 •

edited

Loading

kostasrim Sep 2, 2025 •

edited

Loading