Skip to content

Conversation

@rkistner
Copy link
Contributor

@rkistner rkistner commented Jun 19, 2025

The bug

As background, we calculate an unsigned 32-bit checksum for each synced operation. For checkpoints, we aggregate this by summing, and discarding any higher bits. As a side-note, the protocol is a little inconsistent in whether these are returned as signed or unsigned integers, so the clients handle both.

For MongoDB storage, we stored these u32 numbers directly. Internally, MongoDB could store these as:

  1. int32 (int)
  2. int64 (long)
  3. float8 (double)

So what happened was that numbers that don't fit into a signed int32 are converted to double, and this would happen with roughly half the number of rows.

So now when aggregating, MongoDB often gets int + double for the sum, so it converts both to a double for the sum.

In most cases, this is fine. Up to around 4 million* operations in a single bucket, the double addition is still precise. However, for buckets with more than 4 million operations, this checksum calculation can be inaccurate - typically off by 1. The result is that the client repeatedly gets checksum failures when syncing the bucket.

If the data is spread over multiple buckets, the same issue does not occur - only when there are more than 4 million operations in a single bucket.

The fix

This fixes the issue on two levels:

  1. Always store the checksums as a long, instead of int or double.
  2. Convert checksums to long when aggregating, to cater for any existing data.

In theory we could convert the numbers from unsigned to signed int32 when storing, instead of long. However:

  1. The aggregation would still have to convert to long, even if that happens implicitly.
  2. This would change the values being stored, which would be a bigger change and require more testing.

We could still investigate that option in the future.

Why 4 million?

Integers larger than 2^53 cannot be safely stored in a double/float8 without losing precision. Since our checksums go up to 2^32, that means we have an upper bound of 2^21=2097152 that we could always safely do the checksum calculation for.

In practice, checksums have a fairly even spread, averaging around 2^31. This doubles our old practical limit to a little over 4 million operations per bucket.

@changeset-bot
Copy link

changeset-bot bot commented Jun 19, 2025

🦋 Changeset detected

Latest commit: e8cfbed

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 11 packages
Name Type
@powersync/service-module-mongodb-storage Patch
@powersync/service-core Patch
@powersync/service-image Patch
@powersync/service-schema Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres Patch
@powersync/service-core-tests Patch
@powersync/service-module-core Patch
@powersync/service-module-postgres-storage Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rkistner rkistner changed the title Fix checksum calculations in large buckets Fix checksum calculations in large buckets with > 4m rows Jun 19, 2025
@rkistner rkistner force-pushed the fix-checksum-storage branch from c72d12e to e8cfbed Compare June 19, 2025 12:57
@rkistner rkistner marked this pull request as ready for review June 19, 2025 12:57
@rkistner rkistner requested a review from stevensJourney June 19, 2025 12:57
@rkistner rkistner merged commit 1b326fb into main Jun 19, 2025
21 checks passed
@rkistner rkistner deleted the fix-checksum-storage branch June 19, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants