Fixed per-chunk overlap calculation for float context_size#512
Fixed per-chunk overlap calculation for float context_size#512anaslimem wants to merge 1 commit intochonkie-inc:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR fixes how OverlapRefinery interprets a float context_size by computing overlap size per source chunk (instead of using a single overlap size derived from the maximum chunk size), aligning overlap proportion with each chunk’s own token count.
Changes:
- Update
_refine_prefix()to computeeffective_context_sizeper-chunk whencontext_sizeis a float. - Update
_refine_suffix()to computeeffective_context_sizeper-chunk whencontext_sizeis a float.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Calculate effective context size per chunk if context_size is a float | ||
| if isinstance(self.context_size, float): | ||
| effective_context_size = int(self.context_size * prev_chunk.token_count) | ||
|
|
There was a problem hiding this comment.
The new per-chunk float context sizing behavior isn’t directly asserted in tests right now (existing tests mainly check that context exists). Add a focused test that uses uneven token_counts and asserts the computed overlap length is proportional per chunk (including a prefix+merge case to guard against token_count mutation affecting later iterations).
|
|
||
| # Calculate effective context size per chunk if context_size is a float | ||
| if isinstance(self.context_size, float): | ||
| effective_context_size = int(self.context_size * prev_chunk.token_count) | ||
|
|
There was a problem hiding this comment.
In _refine_prefix, effective_context_size is derived from prev_chunk.token_count, but in prefix+merge mode each chunk’s token_count is increased when it receives context. On the next iteration that mutated token_count is used as the basis for overlap, which can inflate overlap sizes and potentially cause context “cascade” across chunks. Consider snapshotting the original token counts before mutating chunks (e.g., original_token_counts = [c.token_count for c in chunks]) and using that snapshot for float calculations, so per-chunk overlap is based on the original chunk size.
|
|
||
| # Calculate effective context size per chunk if context_size is a float | ||
| if isinstance(self.context_size, float): | ||
| effective_context_size = int(self.context_size * prev_chunk.token_count) |
There was a problem hiding this comment.
The float -> int conversion can yield effective_context_size == 0 for small chunks (e.g., 0.25 * 1 -> 0). In prefix token overlap, tokens[-0:] returns the full token list, so a 0 overlap becomes “entire chunk” overlap; in recursive mode an effective_context_size of 0 can also break _split_text when it falls back to token splitting (range step 0). Clamp the computed value to at least 1 (or use math.ceil with a minimum of 1) when prev_chunk.token_count > 0.
| effective_context_size = int(self.context_size * prev_chunk.token_count) | |
| # Ensure that for non-empty chunks the effective context size is at least 1. | |
| # This avoids cases where int(self.context_size * token_count) becomes 0, | |
| # which would otherwise lead to unintended "entire chunk" overlap or | |
| # downstream errors in recursive splitting logic. | |
| if prev_chunk.token_count > 0: | |
| effective_context_size = max( | |
| 1, int(self.context_size * prev_chunk.token_count) | |
| ) | |
| else: | |
| # No tokens to overlap; keep context size at 0. | |
| effective_context_size = 0 |
| # Get the previous chunk | ||
| prev_chunk = chunks[i + 1] | ||
|
|
||
| # Calculate effective context size per chunk if context_size is a float | ||
| if isinstance(self.context_size, float): | ||
| effective_context_size = int(self.context_size * prev_chunk.token_count) | ||
|
|
||
| # Calculate the overlap context | ||
| context = self._get_suffix_overlap_context(prev_chunk, effective_context_size) |
There was a problem hiding this comment.
In _refine_suffix, prev_chunk = chunks[i + 1] is actually the next chunk (the source of the suffix context), so both the variable name and the comment “Get the previous chunk” are misleading. Renaming to next_chunk (and updating the comment accordingly) would make the control flow easier to follow and reduce mistakes when modifying this logic.
| # Get the previous chunk | |
| prev_chunk = chunks[i + 1] | |
| # Calculate effective context size per chunk if context_size is a float | |
| if isinstance(self.context_size, float): | |
| effective_context_size = int(self.context_size * prev_chunk.token_count) | |
| # Calculate the overlap context | |
| context = self._get_suffix_overlap_context(prev_chunk, effective_context_size) | |
| # Get the next chunk (source of the suffix context) | |
| next_chunk = chunks[i + 1] | |
| # Calculate effective context size per chunk if context_size is a float | |
| if isinstance(self.context_size, float): | |
| effective_context_size = int(self.context_size * next_chunk.token_count) | |
| # Calculate the overlap context | |
| context = self._get_suffix_overlap_context(next_chunk, effective_context_size) |
There was a problem hiding this comment.
Code Review
This pull request aims to correctly address the issue of calculating overlap for a float context_size on a per-chunk basis. However, it introduces a potential Denial of Service (DoS) vulnerability in the _refine_prefix method. This vulnerability arises because the calculation of overlap size for a chunk depends on the modified token_count of the previous chunk, leading to cumulative growth of added context and potential excessive memory consumption. Additionally, there is duplicated code in _refine_prefix and _refine_suffix that should be refactored for better maintainability.
| if isinstance(self.context_size, float): | ||
| effective_context_size = int(self.context_size * prev_chunk.token_count) |
There was a problem hiding this comment.
The calculation of effective_context_size in _refine_prefix for a float context_size is based on the token_count of the previous chunk, which is modified in-place during iteration. This creates a cumulative growth of added context, potentially leading to a Denial of Service (DoS) due to excessive memory and CPU usage, especially with large context_size values and many chunks. To mitigate this, consider pre-calculating effective_context_size for all chunks based on their original sizes before the refinement loop. Additionally, this logic is duplicated in _refine_suffix (lines 399-400); extracting it into a shared helper method would improve maintainability and adhere to the DRY principle.
| if isinstance(self.context_size, float): | ||
| effective_context_size = int(self.context_size * prev_chunk.token_count) |
Fix Float Context Size Calculation in OverlapRefinery
Summary
Fixed the float
context_sizecalculation inOverlapRefineryto compute overlap per-chunk instead of using a fixed size for all chunks.Problem
When using a float
context_size(e.g.,0.25for 25% overlap), the previous implementation calculated overlap using the maximum token count across all chunks:This meant every chunk received the same overlap amount (25% of the largest chunk), regardless of its actual size.
Example of the Bug
[100 tokens, 200 tokens, 300 tokens]context_size = 0.25(intended: 25% per chunk)Smaller chunks were getting proportionally more overlap than intended.
Solution
Modified
_refine_prefix()and_refine_suffix()to calculate effective context size per-chunk whencontext_sizeis a float:Files Changed
src/chonkie/refinery/overlap.py_refine_prefix()method (lines 298-301)_refine_suffix()method (lines 395-398)Testing
All 40 existing tests pass:
Backwards Compatibility
This fix maintains backwards compatibility for integer
context_sizevalues. Only float values benefit from the corrected per-chunk calculation.