Skip to content

Conversation

@FGasper
Copy link
Collaborator

@FGasper FGasper commented Nov 5, 2024

Previously this tool would, for each partition, fetch all documents for src & dst then compare them. The problem here is that, if any partition happens to be oversize, the verifier might consume so much memory that it crashes.

This changeset solves that problem by making the verifier compare documents as they’re fetched (i.e., in a thread concurrent to the document-fetching threads) and sorting the query results.

For example, assume the source has documents [A B C0 D], and the destination has [A C1 D]. (C0 and C1 mismatch.) One thread will fetch from the source, and another will fetch from the destination. Before this changeset, all 7 documents would be cached in memory before they’re compared. With this change, a third thread will receive documents via channels from the other two threads. The checker thread compares each incoming document against its “peer” cache and deletes the cache entry. For example, it may work thus:

  1. Receive A from source & cache it.
  2. Receive A from destination. We have the source’s A, so we can compare them and discard the cached A.
  3. Receive B from source & cache it.
  4. Receive C1 from destination & cache it.
  5. (The source-reader thread lags a bit.) Receive D from destination & cache it.
  6. Receive C0 from the source. We have the destination’s C, so we can compare them and discard the cached C. They mismatch, so we save that result.
  7. Receive D from the source. Compare with the destination’s, and discard the cached one.

At the end, the only cached document will be the source’s B. Since this document remains in the source’s document cache, we know it’s missing on the destination. Note that we only cached at most 3 documents, and that we reduced memory consumption as the fetch proceeded.

This may slightly affect performance because the document-checker thread now constrains the reader threads. At the same time, the CPU-intensive work of comparing the documents now happens alongside the I/O-intensive work of fetching them, which may offset that or even effect a net improvement. (In testing no effect was visible.)

A test against a large dataset showed that this changeset reduced memory thus:

  • VSZ: 11 GiB to 3 GiB
  • RSS: 3 GiB to 42 MiB

@FGasper FGasper requested a review from tdq45gj November 5, 2024 20:24
Copy link

@jsflax jsflax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, LGTM.

Copy link
Collaborator

@tdq45gj tdq45gj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@FGasper FGasper merged commit 08cec97 into mongodb-labs:main Nov 7, 2024
5 checks passed
@FGasper FGasper deleted the REP-5230-check-docs-as-fetched branch November 7, 2024 18:11
FGasper added a commit that referenced this pull request Nov 15, 2024
Previously all change events’ document sizes were recorded “pessimistically” as 16 MiB. This helped to avoid OOMs. It came at a cost, though: when the recheck queue is converted to recheck tasks, those tasks are sized so as to approximate the configured partition size. Thus, if the partition size was 400 MiB (the default), only 25 change events could fit into a recheck task. If there are 250,000 pending rechecks—not unfeasible for a large, busy data set after generation 0—that’s 10,000 tasks to create and perform, which is inefficient.

PR #34 all but eliminates the OOMs, which undercuts that “pessimism”’s benefit. It makes more sense now to allow for the possibility of large recheck tasks in order to minimize the number of tasks. Moreover, we can get pretty good confidence about document sizes from change events anyway:
- Insert & replace events always include the `fullDocument`.
- Update events can be configured to include the current `fullDocument`.
- Delete events refer to a document that probably no longer exists, so we can safely estimate its size to be “small”.

This changeset, then:
1. configures the change stream to include `fullDocument` in update events, and
2. records document sizes from the change event.
FGasper added a commit that referenced this pull request Nov 15, 2024
PR #34 errantly omitted some error checks. This changeset restores those.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants