Skip to content

Conversation

kostasrim
Copy link
Contributor

Do not review.

};
fb2::LockGuard lk(replicaof_mu_);
// Deep copy because tl_replica might be overwritten inbetween
auto replica = tl_replica;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bread and butter no1 of this PR. Bye bye blocking info command because of the mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO worth adding a test

// If we are called by "Replicate", tx will be null but we do not need
// to flush anything.

util::fb2::LockGuard lk(replicaof_mu_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bread and butter no2. No more loading state prematurely. We only get into loading_state if we are doing full sync. Otherwise, no state change at all.

@kostasrim
Copy link
Contributor Author

kostasrim commented Sep 1, 2025

TODO: move this to ReplicaOfInternalV2 such that the new algorithm is easy to review. IMO with the changes and this gh diff is unreviewable 😱

@kostasrim kostasrim force-pushed the kpr3 branch 3 times, most recently from 38640b3 to 76598f2 Compare September 2, 2025 13:28

logging.info(f"succeses: {num_successes}")
assert COMMANDS_TO_ISSUE > num_successes, "At least one REPLICAOF must be cancelled"
assert COMMANDS_TO_ISSUE == num_successes
Copy link
Contributor Author

@kostasrim kostasrim Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new algorithm does not use two phase locking so the following is no longer a possibility:

  1. client 1 -> REPLICAOF -> Locks the mutex -> updates replica_ to new_replica -> releases the mutex -> calls replica_->Start()
  2. client 2 -> REPLICAOF -> same as (1) but first calls replica_->Stop() -> releases the mutex
  3. client 1 -> REPLICAOF command grabs the second lock to the mutex, observes that the context got cancelled because of step (2) and boom returns "replication cancelled"

This can not happen anymore because we lock only once and atomically update everything including stopping the previous replica. So by the time (2) grabs the lock in the example above, the previous REPLICAOF command is not in some intermediate state. To observe that indeed we cancelled, we should read the logs and see ("Stopping replication") COMMANDS_TO_ISSUE - 1 times + 1 (because of the Shutdown() at the end)

Bonus points is that I suspect we might be able to also solve #4685 but I will need to follow up with that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant