REP-5230 Check documents while they’re being fetched. #34

FGasper · 2024-11-05T13:28:53Z

Previously this tool would, for each partition, fetch all documents for src & dst then compare them. The problem here is that, if any partition happens to be oversize, the verifier might consume so much memory that it crashes.

This changeset solves that problem by making the verifier compare documents as they’re fetched (i.e., in a thread concurrent to the document-fetching threads) and sorting the query results.

For example, assume the source has documents [A B C0 D], and the destination has [A C1 D]. (C0 and C1 mismatch.) One thread will fetch from the source, and another will fetch from the destination. Before this changeset, all 7 documents would be cached in memory before they’re compared. With this change, a third thread will receive documents via channels from the other two threads. The checker thread compares each incoming document against its “peer” cache and deletes the cache entry. For example, it may work thus:

Receive A from source & cache it.
Receive A from destination. We have the source’s A, so we can compare them and discard the cached A.
Receive B from source & cache it.
Receive C1 from destination & cache it.
(The source-reader thread lags a bit.) Receive D from destination & cache it.
Receive C0 from the source. We have the destination’s C, so we can compare them and discard the cached C. They mismatch, so we save that result.
Receive D from the source. Compare with the destination’s, and discard the cached one.

At the end, the only cached document will be the source’s B. Since this document remains in the source’s document cache, we know it’s missing on the destination. Note that we only cached at most 3 documents, and that we reduced memory consumption as the fetch proceeded.

This may slightly affect performance because the document-checker thread now constrains the reader threads. At the same time, the CPU-intensive work of comparing the documents now happens alongside the I/O-intensive work of fetching them, which may offset that or even effect a net improvement. (In testing no effect was visible.)

A test against a large dataset showed that this changeset reduced memory thus:

VSZ: 11 GiB to 3 GiB
RSS: 3 GiB to 42 MiB

jsflax

Great work, LGTM.

tdq45gj

LGTM

Previously all change events’ document sizes were recorded “pessimistically” as 16 MiB. This helped to avoid OOMs. It came at a cost, though: when the recheck queue is converted to recheck tasks, those tasks are sized so as to approximate the configured partition size. Thus, if the partition size was 400 MiB (the default), only 25 change events could fit into a recheck task. If there are 250,000 pending rechecks—not unfeasible for a large, busy data set after generation 0—that’s 10,000 tasks to create and perform, which is inefficient. PR #34 all but eliminates the OOMs, which undercuts that “pessimism”’s benefit. It makes more sense now to allow for the possibility of large recheck tasks in order to minimize the number of tasks. Moreover, we can get pretty good confidence about document sizes from change events anyway: - Insert & replace events always include the `fullDocument`. - Update events can be configured to include the current `fullDocument`. - Delete events refer to a document that probably no longer exists, so we can safely estimate its size to be “small”. This changeset, then: 1. configures the change stream to include `fullDocument` in update events, and 2. records document sizes from the change event.

PR #34 errantly omitted some error checks. This changeset restores those.

FGasper added 15 commits November 4, 2024 15:04

save

d4379c9

save

757b177

minimize memory usage

10697d6

save

01e402f

fix comparison logic

bc1963c

no indirection

cfe597a

fix tests; remove documentmap

b99b9dd

fix tests

063f892

comments

2fb4b09

vendor

d786bbd

mod tidy

39f9f41

exp/slices

576f660

exp slices

9842596

partition test

a406959

Receive docs back & forth.

45ac93e

FGasper requested a review from tdq45gj November 5, 2024 20:24

Merge branch 'main' into REP-5230-check-docs-as-fetched

fde4818

jsflax approved these changes Nov 6, 2024

View reviewed changes

do CI

f5a023a

tdq45gj approved these changes Nov 7, 2024

View reviewed changes

FGasper merged commit 08cec97 into mongodb-labs:main Nov 7, 2024
5 checks passed

FGasper deleted the REP-5230-check-docs-as-fetched branch November 7, 2024 18:11

This was referenced Nov 14, 2024

REP-5230 Fix missing error checks. #42

Merged

REP-5283 Report change event document sizes accurately. #43

Merged

FGasper added a commit that referenced this pull request Nov 15, 2024

REP-5230 Fix missing error checks. (#42)

4279431

PR #34 errantly omitted some error checks. This changeset restores those.

FGasper mentioned this pull request Nov 26, 2024

Read documents in parallel #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

REP-5230 Check documents while they’re being fetched. #34

REP-5230 Check documents while they’re being fetched. #34

Uh oh!

FGasper commented Nov 5, 2024 •

edited

Loading

Uh oh!

jsflax left a comment

Uh oh!

tdq45gj left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

REP-5230 Check documents while they’re being fetched. #34

REP-5230 Check documents while they’re being fetched. #34

Uh oh!

Conversation

FGasper commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsflax left a comment

Choose a reason for hiding this comment

Uh oh!

tdq45gj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FGasper commented Nov 5, 2024 •

edited

Loading