-
Notifications
You must be signed in to change notification settings - Fork 15
REP-5230 Check documents while they’re being fetched. #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
FGasper
merged 17 commits into
mongodb-labs:main
from
FGasper:REP-5230-check-docs-as-fetched
Nov 7, 2024
Merged
REP-5230 Check documents while they’re being fetched. #34
FGasper
merged 17 commits into
mongodb-labs:main
from
FGasper:REP-5230-check-docs-as-fetched
Nov 7, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jsflax
approved these changes
Nov 6, 2024
jsflax
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, LGTM.
tdq45gj
approved these changes
Nov 7, 2024
Collaborator
tdq45gj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This was referenced Nov 14, 2024
FGasper
added a commit
that referenced
this pull request
Nov 15, 2024
Previously all change events’ document sizes were recorded “pessimistically” as 16 MiB. This helped to avoid OOMs. It came at a cost, though: when the recheck queue is converted to recheck tasks, those tasks are sized so as to approximate the configured partition size. Thus, if the partition size was 400 MiB (the default), only 25 change events could fit into a recheck task. If there are 250,000 pending rechecks—not unfeasible for a large, busy data set after generation 0—that’s 10,000 tasks to create and perform, which is inefficient. PR #34 all but eliminates the OOMs, which undercuts that “pessimism”’s benefit. It makes more sense now to allow for the possibility of large recheck tasks in order to minimize the number of tasks. Moreover, we can get pretty good confidence about document sizes from change events anyway: - Insert & replace events always include the `fullDocument`. - Update events can be configured to include the current `fullDocument`. - Delete events refer to a document that probably no longer exists, so we can safely estimate its size to be “small”. This changeset, then: 1. configures the change stream to include `fullDocument` in update events, and 2. records document sizes from the change event.
FGasper
added a commit
that referenced
this pull request
Nov 15, 2024
PR #34 errantly omitted some error checks. This changeset restores those.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously this tool would, for each partition, fetch all documents for src & dst then compare them. The problem here is that, if any partition happens to be oversize, the verifier might consume so much memory that it crashes.
This changeset solves that problem by making the verifier compare documents as they’re fetched (i.e., in a thread concurrent to the document-fetching threads) and sorting the query results.
For example, assume the source has documents [A B C0 D], and the destination has [A C1 D]. (C0 and C1 mismatch.) One thread will fetch from the source, and another will fetch from the destination. Before this changeset, all 7 documents would be cached in memory before they’re compared. With this change, a third thread will receive documents via channels from the other two threads. The checker thread compares each incoming document against its “peer” cache and deletes the cache entry. For example, it may work thus:
At the end, the only cached document will be the source’s B. Since this document remains in the source’s document cache, we know it’s missing on the destination. Note that we only cached at most 3 documents, and that we reduced memory consumption as the fetch proceeded.
This may slightly affect performance because the document-checker thread now constrains the reader threads. At the same time, the CPU-intensive work of comparing the documents now happens alongside the I/O-intensive work of fetching them, which may offset that or even effect a net improvement. (In testing no effect was visible.)
A test against a large dataset showed that this changeset reduced memory thus: