Skip to content

MB-62182: Making the merge planner aware of segment file size#2134

Merged
Thejas-bhat merged 4 commits intomasterfrom
merge_plan
Mar 27, 2025
Merged

MB-62182: Making the merge planner aware of segment file size#2134
Thejas-bhat merged 4 commits intomasterfrom
merge_plan

Conversation

@Thejas-bhat
Copy link
Member

@Thejas-bhat Thejas-bhat commented Jan 28, 2025

This PR introduces a new option of FloorSegmentFileSize into the merge planner's algorithm which is used in the budget calculation. The budget calculation drives the merge planner to keep the actual number of file segments within a budget.

Currently the budget calculation is dependent on the number of documents however this doesn't translate well in terms of resource utilization when the per doc size - which directly translates to the segment file size, varies significantly across use-cases. The scoring mechanism is kept the same for now - this is because the factors involved in scoring a set of segments end up canceling out the per doc size factor. For eg the nonDelRatio's totAfterSize and totBeforeSize would be computed as numLiveDocs * perDocSize / totalDocs * perDocSize which ends up cancelling out anyways. Similar thought is applicable for other scoring factors as well, however this can be subjected to future experimentation.

A single persister worker tries to merge the MaxSizeInMemoryMergePerWorker worth of data and the corresponding file would also be roughly that size. On the merger side, the FloorSegmentFileSize is like the lowest value on the logarithmic staircase (the first tier size) and using which we come up with the ideal number of segments. So, the idea over here is to set first tier size to be equal to the max size of a fresh file that's been flushed out - which approximately equals to MaxSizeInMemoryMergePerWorker. This makes the merger more aware of much data is being flushed out to disk and make its decision accordingly.

@Thejas-bhat Thejas-bhat force-pushed the merge_plan branch 2 times, most recently from c539248 to 7b97b52 Compare February 6, 2025 06:02
@Thejas-bhat Thejas-bhat force-pushed the merge_plan branch 2 times, most recently from de1239e to 259ff3d Compare March 3, 2025 05:31
@Thejas-bhat Thejas-bhat force-pushed the merge_plan branch 5 times, most recently from 7dc5c51 to 4d18332 Compare March 13, 2025 04:55
@Thejas-bhat Thejas-bhat changed the title WIP: Using fileSize for budget calculation MB-62182: Introducing FloorSegmentFileSize into the merge_planner Mar 19, 2025
@Thejas-bhat Thejas-bhat changed the title MB-62182: Introducing FloorSegmentFileSize into the merge_planner MB-62182: Making the merge planner aware of segment file size Mar 19, 2025
@abhinavdangeti abhinavdangeti modified the milestones: v2.5.1, v2.5.0 Mar 20, 2025
@abhinavdangeti
Copy link
Member

@Thejas-bhat can you confirm that we're considering this for 2.5.0 and is ready for review?

@Thejas-bhat Thejas-bhat marked this pull request as ready for review March 24, 2025 03:57
@Thejas-bhat
Copy link
Member Author

Yep we should consider going with this along with #2100 since the perf tests indicated better control of the file count and the latency with this PR.

@Thejas-bhat Thejas-bhat changed the base branch from batchFlush to master March 27, 2025 05:44
@Thejas-bhat Thejas-bhat dismissed CascadingRadium’s stale review March 27, 2025 05:44

The base branch was changed.

@Thejas-bhat Thejas-bhat merged commit 47f95bc into master Mar 27, 2025
9 checks passed
@CascadingRadium CascadingRadium deleted the merge_plan branch March 27, 2025 10:54
ns-codereview pushed a commit to couchbase/cbft that referenced this pull request Mar 27, 2025
			friendly

- blevesearch/bleve#2100
- blevesearch/bleve#2134

Change-Id: I8e3e6e8f60d16094d3d4ddee44115c07a872fcfb
Reviewed-on: https://review.couchbase.org/c/cbft/+/225100
Reviewed-by: Abhi Dangeti <abhinav@couchbase.com>
Tested-by: <thejas.orkombu@couchbase.com>
Well-Formed: Build Bot <build@couchbase.com>
project-mirrors-bot-tu bot pushed a commit to project-mirrors/forgejo-as-gitea-fork that referenced this pull request Apr 6, 2025
…-gitea#7468)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [github.com/blevesearch/bleve/v2](https://github.com/blevesearch/bleve) | require | minor | `v2.4.4` -> `v2.5.0` |

---

### Release Notes

<details>
<summary>blevesearch/bleve (github.com/blevesearch/bleve/v2)</summary>

### [`v2.5.0`](https://github.com/blevesearch/bleve/releases/tag/v2.5.0)

[Compare Source](blevesearch/bleve@v2.4.4...v2.5.0)

##### Bug Fixes

-   Exact hits to score higher than fuzzy hits, with blevesearch/bleve#2056
-   Fix boosting during hybrid search that involves text + nearest neighbor, with blevesearch/bleve#2127
-   Addressed bug in IP field handling while highlighting, with blevesearch/bleve#2142
-   Graceful error handling within registry, with blevesearch/bleve#2151
-   `http/` package (meant for demo purposes) removed from repository to remove vulnerability - [CVE-2022-31022](GHSA-9w9f-6mg8-jp7w), relocated to within https://github.com/blevesearch/bleve-explorer
-   Geo radius queries will now advertise distances (within sort values) in readable format, with blevesearch/bleve#2137

##### Improvements

-   Vector search requires `faiss` dynamic library to be built from [blevesearch/faiss@352484e](https://github.com/blevesearch/faiss/tree/352484e0fc9d1f8f46737841efe5f26e0f383f71) which is a modified version of [v1.10.0](https://github.com/facebookresearch/faiss/releases/tag/v1.10.0)
-   Support for **BM25 scoring**, see: [scoring.md](https://github.com/blevesearch/bleve/blob/v2.5.0/docs/scoring.md#bm25)
-   Support for **synonyms' search**, see: [synonyms.md](https://github.com/blevesearch/bleve/blob/v2.5.0/docs/synonyms.md)
-   **Significant performance improvements in pre-filtered vector search**, with blevesearch/bleve#2169 + dependent changes
-   `auto` fuzziness detection with blevesearch/bleve#2060
-   Ability to affect ingestion/drain rate by tuning persister workers with blevesearch/bleve#2100
-   Additional config in merge policy for improved merger behavior, with blevesearch/bleve#2134
-   Geo improvements: footprint reduction for polygons, better validation and graceful error handling, with blevesearch/bleve#2162 + blevesearch/bleve#2158 + blevesearch/bleve#2165
-   Upgrade to RoaringBitmap/roaring@v2.4.5, etcd.io/bbolt@v1.4.0
-   More metrics

##### Milestone

-   [v2.5.0](https://github.com/blevesearch/bleve/milestone/24)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "* 0-3 * * *" (UTC), Automerge - "* 0-3 * * *" (UTC).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMjIuMSIsInVwZGF0ZWRJblZlciI6IjM5LjIyMi4xIiwidGFyZ2V0QnJhbmNoIjoiZm9yZ2VqbyIsImxhYmVscyI6WyJkZXBlbmRlbmN5LXVwZ3JhZGUiLCJ0ZXN0L25vdC1uZWVkZWQiXX0=-->

Co-authored-by: Gusted <postmaster@gusted.xyz>
Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/7468
Reviewed-by: Gusted <gusted@noreply.codeberg.org>
Reviewed-by: Shiny Nematoda <snematoda@noreply.codeberg.org>
Co-authored-by: Renovate Bot <forgejo-renovate-action@forgejo.org>
Co-committed-by: Renovate Bot <forgejo-renovate-action@forgejo.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants