MB-62182: Making the merge planner aware of segment file size#2134
Merged
Thejas-bhat merged 4 commits intomasterfrom Mar 27, 2025
Merged
MB-62182: Making the merge planner aware of segment file size#2134Thejas-bhat merged 4 commits intomasterfrom
Thejas-bhat merged 4 commits intomasterfrom
Conversation
d21cbc3 to
bb8b768
Compare
c539248 to
7b97b52
Compare
bb8b768 to
5a32c15
Compare
de1239e to
259ff3d
Compare
7dc5c51 to
4d18332
Compare
FloorSegmentFileSize into the merge_planner
FloorSegmentFileSize into the merge_planner
Member
|
@Thejas-bhat can you confirm that we're considering this for 2.5.0 and is ready for review? |
Member
Author
|
Yep we should consider going with this along with #2100 since the perf tests indicated better control of the file count and the latency with this PR. |
CascadingRadium
previously approved these changes
Mar 25, 2025
The base branch was changed.
CascadingRadium
approved these changes
Mar 27, 2025
Likith101
approved these changes
Mar 27, 2025
ns-codereview
pushed a commit
to couchbase/cbft
that referenced
this pull request
Mar 27, 2025
friendly - blevesearch/bleve#2100 - blevesearch/bleve#2134 Change-Id: I8e3e6e8f60d16094d3d4ddee44115c07a872fcfb Reviewed-on: https://review.couchbase.org/c/cbft/+/225100 Reviewed-by: Abhi Dangeti <abhinav@couchbase.com> Tested-by: <thejas.orkombu@couchbase.com> Well-Formed: Build Bot <build@couchbase.com>
project-mirrors-bot-tu bot
pushed a commit
to project-mirrors/forgejo-as-gitea-fork
that referenced
this pull request
Apr 6, 2025
…-gitea#7468) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [github.com/blevesearch/bleve/v2](https://github.com/blevesearch/bleve) | require | minor | `v2.4.4` -> `v2.5.0` | --- ### Release Notes <details> <summary>blevesearch/bleve (github.com/blevesearch/bleve/v2)</summary> ### [`v2.5.0`](https://github.com/blevesearch/bleve/releases/tag/v2.5.0) [Compare Source](blevesearch/bleve@v2.4.4...v2.5.0) ##### Bug Fixes - Exact hits to score higher than fuzzy hits, with blevesearch/bleve#2056 - Fix boosting during hybrid search that involves text + nearest neighbor, with blevesearch/bleve#2127 - Addressed bug in IP field handling while highlighting, with blevesearch/bleve#2142 - Graceful error handling within registry, with blevesearch/bleve#2151 - `http/` package (meant for demo purposes) removed from repository to remove vulnerability - [CVE-2022-31022](GHSA-9w9f-6mg8-jp7w), relocated to within https://github.com/blevesearch/bleve-explorer - Geo radius queries will now advertise distances (within sort values) in readable format, with blevesearch/bleve#2137 ##### Improvements - Vector search requires `faiss` dynamic library to be built from [blevesearch/faiss@352484e](https://github.com/blevesearch/faiss/tree/352484e0fc9d1f8f46737841efe5f26e0f383f71) which is a modified version of [v1.10.0](https://github.com/facebookresearch/faiss/releases/tag/v1.10.0) - Support for **BM25 scoring**, see: [scoring.md](https://github.com/blevesearch/bleve/blob/v2.5.0/docs/scoring.md#bm25) - Support for **synonyms' search**, see: [synonyms.md](https://github.com/blevesearch/bleve/blob/v2.5.0/docs/synonyms.md) - **Significant performance improvements in pre-filtered vector search**, with blevesearch/bleve#2169 + dependent changes - `auto` fuzziness detection with blevesearch/bleve#2060 - Ability to affect ingestion/drain rate by tuning persister workers with blevesearch/bleve#2100 - Additional config in merge policy for improved merger behavior, with blevesearch/bleve#2134 - Geo improvements: footprint reduction for polygons, better validation and graceful error handling, with blevesearch/bleve#2162 + blevesearch/bleve#2158 + blevesearch/bleve#2165 - Upgrade to RoaringBitmap/roaring@v2.4.5, etcd.io/bbolt@v1.4.0 - More metrics ##### Milestone - [v2.5.0](https://github.com/blevesearch/bleve/milestone/24) </details> --- ### Configuration 📅 **Schedule**: Branch creation - "* 0-3 * * *" (UTC), Automerge - "* 0-3 * * *" (UTC). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMjIuMSIsInVwZGF0ZWRJblZlciI6IjM5LjIyMi4xIiwidGFyZ2V0QnJhbmNoIjoiZm9yZ2VqbyIsImxhYmVscyI6WyJkZXBlbmRlbmN5LXVwZ3JhZGUiLCJ0ZXN0L25vdC1uZWVkZWQiXX0=--> Co-authored-by: Gusted <postmaster@gusted.xyz> Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/7468 Reviewed-by: Gusted <gusted@noreply.codeberg.org> Reviewed-by: Shiny Nematoda <snematoda@noreply.codeberg.org> Co-authored-by: Renovate Bot <forgejo-renovate-action@forgejo.org> Co-committed-by: Renovate Bot <forgejo-renovate-action@forgejo.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a new option of
FloorSegmentFileSizeinto the merge planner's algorithm which is used in the budget calculation. The budget calculation drives the merge planner to keep the actual number of file segments within abudget.Currently the budget calculation is dependent on the number of documents however this doesn't translate well in terms of resource utilization when the per doc size - which directly translates to the segment file size, varies significantly across use-cases. The scoring mechanism is kept the same for now - this is because the factors involved in scoring a set of segments end up canceling out the per doc size factor. For eg the nonDelRatio's totAfterSize and totBeforeSize would be computed as
numLiveDocs * perDocSize / totalDocs * perDocSizewhich ends up cancelling out anyways. Similar thought is applicable for other scoring factors as well, however this can be subjected to future experimentation.A single persister worker tries to merge the
MaxSizeInMemoryMergePerWorkerworth of data and the corresponding file would also be roughly that size. On the merger side, theFloorSegmentFileSizeis like the lowest value on the logarithmic staircase (the first tier size) and using which we come up with the ideal number of segments. So, the idea over here is to set first tier size to be equal to the max size of a fresh file that's been flushed out - which approximately equals toMaxSizeInMemoryMergePerWorker. This makes the merger more aware of much data is being flushed out to disk and make its decision accordingly.