[MongoDB Storage] Only compact changed buckets / Indexed bucket_state #375
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#343 introduced pre-computing of bucket checksums as the final step of initial replication.
The issue is that the query for buckets that need checksum calculations (bucket_state collection) was not indexed (and could not be effectively indexed), causing timeouts when the instance has a large number of buckets. We've seen a case of 100 million buckets, where this step fails due to the timeout.
This tweaks the query to use a new partial index on bucket_state, which fixes these timeouts.
This also changes the normal compact process to only compact buckets with changes since the last compact. Compacting 100k+ modified buckets this way is still slow since we have around 3 mongodb queries/operations per bucket, but it should not be slower than what it was, and ignoring unmodified buckets should make a significant difference here.