Fix write checkpoint race condition #201
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The issue
The issue is a race condition that sometimes causes a new write checkpoint to not be acknowledged on an existing connection.
Symptoms of the issue:
downloaded: true
state.The issue appeared to more prevalent on React Native than on other platforms, likely due to the specific timing of requests, although the underlying cause is on the service.
What happened
The expected sequence of events with a write checkpoint is:
Now what could happen is that step 4 is delayed, resulting in this order:
This is difficult to reproduce in practice, since it's very sensitive to timings. I added a test, but the test never managed to reproduce the issue unless I added an artificial delay in the process.
The issue is more likely to happen when there is a low latency to the source database, and a higher latency or higher load on the bucket storage database.
The fix
This fixes the issue for MongoDB and Postgres source dbs by splitting the
getReplicationHead
into a three-phase process:This makes sure that by the time the message is received, the position would already be present in write_checkpoints.
This does not fix the issue for MySQL yet, but does not make it any worse for MySQL.