PCSM-221. Add test to reproduce pcsm duplicate key error when handling shard key updates#144

Merged

inelpandzic merged 13 commits intomainfrom

Dec 11, 2025

Collaborator

sandraromanchenko commented Nov 11, 2025 •

edited by inelpandzic

Loading

Problem

When replicating sharded collections, a race condition can cause duplicate key errors during the replication phase. This occurs when:

A document is inserted with a shard key value (e.g., key_id: 200)
The shard key is immediately updated to a new value (e.g., key_id: 5000)
PCSM clones the document with the updated shard key value
During change stream replication, PCSM processes the original insert event as a ReplaceOne with upsert, but the filter
contains the original shard key value which doesn't match the existing document on the target

The ReplaceOne fails with a duplicate key error on _id because the document already exists but with a different shard key in the filter.

Solution

Added duplicate key error handling for ReplaceOne operations in bulk writes. When a duplicate key error occurs, the failed operation is retried using a delete+insert fallback (delete by _id, then insert the document). Since bulk writes use ordered mode, operations before the failure are already applied and operations after are never executed. The remaining operations are then retried recursively, allowing multiple duplicate key errors in a single batch to be handled sequentially.

sandraromanchenko requested review from boris-ilijic and inelpandzic as code owners

November 11, 2025 06:33

boris-ilijic approved these changes

View reviewed changes

sandraromanchenko force-pushed the PCSM-221 branch from 8887f85 to fc589b2 Compare

November 26, 2025 13:54


          PCSM-221. Add test to reproduce pcsm duplicate key error when handlin…

1a67b07

…g shard key updates

sandraromanchenko force-pushed the PCSM-221 branch from fc589b2 to 1a67b07 Compare

November 26, 2025 13:55

inelpandzic added 2 commits

December 2, 2025 06:44


          Merge branch 'main' into PCSM-221

5a77554


          Merge branch 'main' into PCSM-221

bbd6c45

Copilot AI review requested due to automatic review settings

December 3, 2025 15:39

inelpandzic requested a review from chupe as a code owner

December 3, 2025 15:39

Copilot started reviewing on behalf of inelpandzic

December 3, 2025 15:41

Copilot finished reviewing on behalf of inelpandzic

December 3, 2025 15:43

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR adds a test to reproduce a duplicate key error (PCSM-221) that occurs when the PCSM (Percona Cluster Sync for MongoDB) tool handles shard key updates in sharded collections. The test simulates concurrent shard key updates during the APPLY phase to trigger the race condition.

Key Changes:

Added test_shard_key_update_duplicate_key_error test function that spawns a background thread performing shard key updates while the synchronization is running
Introduced threading and time module imports to support concurrent operations in the test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_collections_sharded.py

+                      key_id = 100 + i
+                      coll.insert_one({"key_id": key_id, "name": f"pre_sync_doc_{i}", "value": f"value_{key_id}"})
+                  stop_event = threading.Event()
+                  def perform_shard_key_updates():

Copilot AI Dec 3, 2025

The num_updates variable is set to 500, but with a 0.05 second sleep per iteration, this would take ~25 seconds to complete. However, the test only runs for 3 seconds before setting the stop event. This means the loop will only complete approximately 60 iterations before stopping. Consider either: (1) reducing num_updates to match the expected runtime (e.g., ~100), or (2) documenting that the thread is intentionally interrupted early for this test scenario.

Suggested change

      
                def perform_shard_key_updates():
          
                def perform_shard_key_updates():
          
                    # Note: The thread running this function is intentionally interrupted early (after ~3 seconds).
          
                    # The loop will not complete all 500 iterations; num_updates is set high to allow for interruption.

Copilot uses AI. Check for mistakes.

tests/test_collections_sharded.py

+                          time.sleep(0.05)
+                  update_thread = threading.Thread(target=perform_shard_key_updates)
+                  update_thread.start()
+                  time.sleep(3)

Copilot AI Dec 3, 2025

Using time.sleep(3) introduces non-deterministic behavior and makes the test timing-dependent. This could lead to flaky test results depending on system load. Consider using synchronization primitives (like the threading.Event already used) to signal when the thread has performed enough operations, or add a comment explaining why the fixed sleep is necessary for reproducing the specific race condition.

Copilot uses AI. Check for mistakes.

tests/test_collections_sharded.py

Comment on lines 105 to 114

+                  def perform_shard_key_updates():
+                      num_updates = 500
+                      for i in range(1, num_updates + 1):
+                          if stop_event.is_set():
+                              break
+                          key_id = 200 + i
+                          new_key_id = 5000 + i
+                          coll.insert_one({"key_id": key_id, "name": f"test_doc_{i}", "value": f"value_{key_id}"})
+                          coll.update_one({"key_id": key_id}, {"$set": {"key_id": new_key_id, "shard_key_updated": True}})
+                          time.sleep(0.05)

Copilot AI Dec 3, 2025

The perform_shard_key_updates function lacks error handling for database operations. If an insert or update fails (e.g., due to network issues or duplicate key errors), the exception will be silently swallowed by the thread, making the test pass even when operations fail. Consider adding try-except blocks that log or store exceptions for later assertion, similar to the pattern in test_pcsm_118_ignore_incomplete_index in test_indexes.py.

Copilot uses AI. Check for mistakes.

tests/test_collections_sharded.py


		t.compare_all_sharded()

		@pytest.mark.parametrize("phase", [Runner.Phase.APPLY])

Copilot AI Dec 3, 2025

Consider adding a @pytest.mark.timeout decorator to prevent the test from hanging indefinitely if the thread or database operations fail to complete. Similar tests using threading (e.g., test_pcsm_118_ignore_incomplete_index in test_indexes.py) use @pytest.mark.timeout(60) to ensure proper cleanup if operations hang.

Suggested change

      
            @pytest.mark.parametrize("phase", [Runner.Phase.APPLY])
          
            @pytest.mark.parametrize("phase", [Runner.Phase.APPLY])
          
            @pytest.mark.timeout(60)

Copilot uses AI. Check for mistakes.

tests/test_collections_sharded.py

+                  time.sleep(3)
+                  with t.run(phase):
+                      stop_event.set()
+                      update_thread.join(timeout=5)

Copilot AI Dec 3, 2025

The join(timeout=5) call may leave the thread running if it doesn't complete within 5 seconds. After the timeout, the thread may still be executing database operations, which could interfere with the subsequent compare_all_sharded() call or leave resources in an inconsistent state. Consider either: (1) checking if the thread is still alive after join and raising an error, or (2) removing the timeout and relying on the stop_event to ensure the thread completes.

Suggested change

      
                    update_thread.join(timeout=5)
          
                    update_thread.join()

Copilot uses AI. Check for mistakes.

inelpandzic added 10 commits

December 4, 2025 06:08


          Merge branch 'main' into PCSM-221

9f4131c


          Handle duplicate key error for replace one afte bulkWrite fails.

1dfa69c


          Get propper write error operation position index.

222036f


          Refactored duplicate key error handling in bulk.go

4534a70

to use a standalone function.


          disable linter

eb1e800


          Merge branch 'main' into PCSM-221

068562e


          Update pyton dep.

f75fc26


          Fix lint errors and warnings.

c9c3acd


          refactor.

a224f98


          Refactor complex nest-if statements and improve

3624fa0

bulk write retry logic.

inelpandzic force-pushed the PCSM-221 branch from 06c078b to 3624fa0 Compare

December 11, 2025 10:32

inelpandzic merged commit f5d9d2a into main

27 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments

boris-ilijic boris-ilijic approved these changes

inelpandzic Awaiting requested review from inelpandzic inelpandzic is a code owner

chupe Awaiting requested review from chupe chupe is a code owner

Labels

None yet