Skip to content

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented Feb 14, 2025

The issue

The issue is a race condition that sometimes causes a new write checkpoint to not be acknowledged on an existing connection.

Symptoms of the issue:

  1. The client would stay in downloaded: true state.
  2. If the client was offline before, changes made in that offline period would not be applied.
  3. Once an additional change is made on the backend db, the write checkpoint will be acknowledged and the issue will be resolved.

The issue appeared to more prevalent on React Native than on other platforms, likely due to the specific timing of requests, although the underlying cause is on the service.

What happened

The expected sequence of events with a write checkpoint is:

  1. Client has an active sync connection, listening for new checkpoints.
  2. Client calls write-checkpoint2.json.
  3. The service creates a new LSN, which also emits it on the WAL / mongo changestream.
  4. The service writes the LSN to write_checkpoints and returns a numeric write checkpoint to the client.
  5. The replication process picks up the new LSN, writes it to the sync-rules document.
  6. The sync process picks up the new LSN/checkpoint from the sync-rules.
  7. The sync process compares it to write_checkpoints, and sees that it matches, and sends it to the client.

Now what could happen is that step 4 is delayed, resulting in this order:

  1. Client has an active sync connection, listening for new checkpoints.
  2. Client calls write-checkpoint2.json.
  3. The service creates a new LSN, which also emits it on the WAL / mongo changestream.
  4. The replication process picks up the new LSN, writes it to the sync-rules document.
  5. The sync process picks up the new LSN/checkpoint from the sync-rules.
  6. The sync process compares it to write_checkpoints, but there is no new entry in write checkpoints yet.
  7. The service writes the LSN to write_checkpoints and returns a numeric write checkpoint to the client.
  8. When a new LSN comes in, that will re-check write_checkpoints, and send the write checkpoint to the client. But if there is no traffic on the db, that can be delayed significantly.

This is difficult to reproduce in practice, since it's very sensitive to timings. I added a test, but the test never managed to reproduce the issue unless I added an artificial delay in the process.

The issue is more likely to happen when there is a low latency to the source database, and a higher latency or higher load on the bucket storage database.

The fix

This fixes the issue for MongoDB and Postgres source dbs by splitting the getReplicationHead into a three-phase process:

  1. Get the current replication position from the source database.
  2. Persist that position in the storage database write_checkpoints.
  3. Send a new message on the replication stream.

This makes sure that by the time the message is received, the position would already be present in write_checkpoints.

This does not fix the issue for MySQL yet, but does not make it any worse for MySQL.

Copy link

changeset-bot bot commented Feb 14, 2025

🦋 Changeset detected

Latest commit: 66bc6f1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
@powersync/service-module-postgres Patch
@powersync/service-module-mongodb Patch
@powersync/service-core Patch
@powersync/service-module-mysql Patch
@powersync/service-image Patch
@powersync/service-core-tests Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-postgres-storage Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rkistner rkistner merged commit ffc8d98 into main Feb 17, 2025
16 checks passed
@rkistner rkistner deleted the fix-write-checkpoint-race-condition branch February 17, 2025 07:05
@rkistner rkistner mentioned this pull request Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants