Skip to content

Conversation

@rkistner
Copy link
Contributor

@rkistner rkistner commented May 8, 2025

This is a rare edge case, but can be consistently triggered in some development setups.

Requirements to reproduce:

  1. Postgres storage.
  2. Sync that ignores specific rows (i.e. the data is replicated to current_data, but not in bucket_data / synced to clients). This can be done by adding a data query with where false in the sync rules.
  3. Start with one or more records for the above table, and no other records.
  4. Sync one client.
  5. Now add a row that does get synced.
  6. The client will now go into a repeated "Checksum failed for ..." loop.

The issue is in how op_id_sequence is initialized in postgres storage. We use select last_value from op_id_sequence to get the op_id for a checkpoint, and select nextval('op_id_sequence') for each op_id in bucket_data. last_value will generally return the largest op_id value in bucket_data, and nextval will always generate a larger op_id. However, if we haven't generated any op_id yet, and generate the first checkpoint, then we get the case that last_value is 1 (used for the first checkpoint). Then for the next checkpoint, nextval gives 1, giving another checkpoint with the same op_id. Together with caching of checksums, this results in a op_id conflict that gives the "Checksum failed" loop.

To see what's going on, run curl http://localhost:8080/sync/stream -X POST -H "Authorization: Bearer <token>. You should see no data initially:

{"checkpoint":{"last_op_id":"1","buckets":[{"bucket":"global_data[]","checksum":0,"count":0,"priority":3}]}}
{"checkpoint_complete":{"last_op_id":"1"}}

Then add a record to sync. The above stream would not get a new checkpoint, while re-running will give this:

{"checkpoint":{"last_op_id":"1","buckets":[{"bucket":"global_data[]","checksum":0,"count":0,"priority":3}]}}
{"data":{"bucket":"global_data[]","after":"0","has_more":false,"data":[{"op_id":"1","op":"PUT","object_type":"test","object_id":"abc","checksum":287648289,"subkey":"...","data":{"id":"abc"}}],"next_after":"1"}}
{"checkpoint_complete":{"last_op_id":"1"}}

Note above:

  1. last_op_id is still 1.
  2. The checkpoint for the bucket still contains "checksum: 0" and "count: 0".
  3. There is however a data line for that bucket.

This causes the checksums to fail on the client.

The issue specifically happens because the count and checksum for op_id 1 is cached and never refreshed. Even when more data is added, the op_id is incrementally updated, and still remains incorrect. The service recovers when it is restarted, clearing the cache.

However, the cache is not the core issue here - op_id 1 being re-used for two checkpoints is the issue.

The fix

One simple fix is to just call select nextval('op_id_sequence') once as part of the migrations. This initializes the sequence, avoiding this issue completely. However, this also changes the op_id sequence for all our test fixtures. Rather than updating them all, and accounting for differences between mongodb and postgres storage, this fix just checks is_called on the sequence, and defaults to 0 instead of 1 for that case.

@changeset-bot
Copy link

changeset-bot bot commented May 8, 2025

🦋 Changeset detected

Latest commit: 90b55b8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
@powersync/service-module-postgres-storage Patch
@powersync/service-core-tests Patch
@powersync/service-image Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-core Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rkistner rkistner requested a review from Copilot May 8, 2025 11:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue in Postgres storage where the op_id_sequence may be incorrectly initialized, leading to op_id conflicts during checkpoint creation. Key changes include:

  • Adding a new test to cover the op_id initialization edge case.
  • Refactoring PostgresBucketBatch to use a centralized getLastOpIdSequence() method.
  • Updating sync rules storage to check for a non-null last_flushed_op.
  • A changeset documenting the patch for Postgres storage.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
packages/service-core-tests/src/tests/register-data-storage-tests.ts Added a test case for op_id initialization edge case.
modules/module-postgres-storage/src/storage/batch/PostgresBucketBatch.ts Refactored to use a shared function for reading the op_id_sequence.
modules/module-postgres-storage/src/storage/PostgresSyncRulesStorage.ts Updated condition to explicitly check for non-null last_flushed_op.
.changeset/sixty-melons-bow.md Added changeset for patch, with a spelling issue in the commit title.

@rkistner rkistner requested a review from stevensJourney May 8, 2025 11:30
@rkistner rkistner marked this pull request as ready for review May 8, 2025 11:30
Copy link
Collaborator

@stevensJourney stevensJourney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix!

@rkistner rkistner merged commit 05b9593 into main May 8, 2025
21 checks passed
@rkistner rkistner deleted the fix-postgres-op-id-init branch May 8, 2025 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants