You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
add support for CTID bucketing with snapshotNumPartitionsOverride (#3624)
PeerDB supports parallel snapshotting to optimize initial load time. To
do this there are two ways today:
1) we compute the total row count in a table, bucket the data evenly by
watermark column -- this is the default behavior and provides even
distribution of data for parallel initial load. However calculating
total row count can be slow on large tables.
2) with `SnapshotNumPartitionsOverride` enabled, rather than calculating
num partitions, we fetch the min / max values of watermark column, and
increment the column evenly to get the partition ranges. In this case we
can't guarantee even distribution of data across partitions, but can
speed up initial snapshot on large table by bypassing calculating total
row count.
This change handles the case for 2) when watermark column is not
explicitly passed in and defaults to `ctid`, which currently is a no-op.
For append-only tables, we expect even distribution of data, so this
change should result in a pure performance win. For updatable tables,
this may result in uneven distribution of data partitioning, but this is
already the case with approach 2).
Thank you @alon-zeltser-cyera for the contribution.
Separate note: `SnapshotNumPartitionsOverride` was introduced to support
use case where num partitions is explicitly provided. There's no reason
that it has to be tied to the two initial snapshot bucketing approaches,
so we may want to evaluate decoupling the two concept later on if we
want to provide this feature more widely.
TODO:
- [x] Add e2e test
- [x] Run test against a large table
---------
Co-authored-by: Alon Zeltser <[email protected]>
0 commit comments