Skip to content

Fixed per-chunk overlap calculation for float context_size#512

Open
anaslimem wants to merge 1 commit intochonkie-inc:mainfrom
anaslimem:fix/float-context-size-overlap
Open

Fixed per-chunk overlap calculation for float context_size#512
anaslimem wants to merge 1 commit intochonkie-inc:mainfrom
anaslimem:fix/float-context-size-overlap

Conversation

@anaslimem
Copy link
Contributor

Fix Float Context Size Calculation in OverlapRefinery

Summary

Fixed the float context_size calculation in OverlapRefinery to compute overlap per-chunk instead of using a fixed size for all chunks.

Problem

When using a float context_size (e.g., 0.25 for 25% overlap), the previous implementation calculated overlap using the maximum token count across all chunks:

# Old implementation (line 422-423)
if isinstance(self.context_size, float):
    return int(self.context_size * max(chunk.token_count for chunk in chunks))

This meant every chunk received the same overlap amount (25% of the largest chunk), regardless of its actual size.

Example of the Bug

  • Chunks: [100 tokens, 200 tokens, 300 tokens]
  • context_size = 0.25 (intended: 25% per chunk)
Chunk Old Behavior Expected
Chunk 1 (100 tokens) 75 tokens (75%) 25 tokens (25%)
Chunk 2 (200 tokens) 75 tokens (37.5%) 50 tokens (25%)
Chunk 3 (300 tokens) 75 tokens (25%) 75 tokens (25%)

Smaller chunks were getting proportionally more overlap than intended.

Solution

Modified _refine_prefix() and _refine_suffix() to calculate effective context size per-chunk when context_size is a float:

# New implementation
if isinstance(self.context_size, float):
    effective_context_size = int(self.context_size * prev_chunk.token_count)

Files Changed

  • src/chonkie/refinery/overlap.py
    • _refine_prefix() method (lines 298-301)
    • _refine_suffix() method (lines 395-398)

Testing

All 40 existing tests pass:

tests/refinery/test_overlap_refinery.py ............ 40 passed

Backwards Compatibility

This fix maintains backwards compatibility for integer context_size values. Only float values benefit from the corrected per-chunk calculation.

Copilot AI review requested due to automatic review settings March 1, 2026 17:36
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the OverlapRefinery where a float context_size was being misapplied uniformly across all chunks based on the largest chunk's token count. The fix ensures that the overlap is now calculated proportionally for each individual chunk, leading to accurate and consistent context handling, particularly for datasets with varying chunk sizes.

Highlights

  • Float Context Size Calculation: Corrected the OverlapRefinery to calculate overlap per-chunk when context_size is a float, addressing an issue where smaller chunks received disproportionately more overlap than intended.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/chonkie/refinery/overlap.py
    • Modified _refine_prefix() to calculate effective_context_size per-chunk for float context_size values.
    • Modified _refine_suffix() to calculate effective_context_size per-chunk for float context_size values.
Activity
  • All 40 existing tests for test_overlap_refinery.py passed, confirming the fix did not introduce regressions and maintains backwards compatibility for integer context_size values.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes how OverlapRefinery interprets a float context_size by computing overlap size per source chunk (instead of using a single overlap size derived from the maximum chunk size), aligning overlap proportion with each chunk’s own token count.

Changes:

  • Update _refine_prefix() to compute effective_context_size per-chunk when context_size is a float.
  • Update _refine_suffix() to compute effective_context_size per-chunk when context_size is a float.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +301 to +304
# Calculate effective context size per chunk if context_size is a float
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * prev_chunk.token_count)

Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new per-chunk float context sizing behavior isn’t directly asserted in tests right now (existing tests mainly check that context exists). Add a focused test that uses uneven token_counts and asserts the computed overlap length is proportional per chunk (including a prefix+merge case to guard against token_count mutation affecting later iterations).

Copilot uses AI. Check for mistakes.
Comment on lines 300 to +304

# Calculate effective context size per chunk if context_size is a float
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * prev_chunk.token_count)

Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _refine_prefix, effective_context_size is derived from prev_chunk.token_count, but in prefix+merge mode each chunk’s token_count is increased when it receives context. On the next iteration that mutated token_count is used as the basis for overlap, which can inflate overlap sizes and potentially cause context “cascade” across chunks. Consider snapshotting the original token counts before mutating chunks (e.g., original_token_counts = [c.token_count for c in chunks]) and using that snapshot for float calculations, so per-chunk overlap is based on the original chunk size.

Copilot uses AI. Check for mistakes.

# Calculate effective context size per chunk if context_size is a float
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * prev_chunk.token_count)
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The float -> int conversion can yield effective_context_size == 0 for small chunks (e.g., 0.25 * 1 -> 0). In prefix token overlap, tokens[-0:] returns the full token list, so a 0 overlap becomes “entire chunk” overlap; in recursive mode an effective_context_size of 0 can also break _split_text when it falls back to token splitting (range step 0). Clamp the computed value to at least 1 (or use math.ceil with a minimum of 1) when prev_chunk.token_count > 0.

Suggested change
effective_context_size = int(self.context_size * prev_chunk.token_count)
# Ensure that for non-empty chunks the effective context size is at least 1.
# This avoids cases where int(self.context_size * token_count) becomes 0,
# which would otherwise lead to unintended "entire chunk" overlap or
# downstream errors in recursive splitting logic.
if prev_chunk.token_count > 0:
effective_context_size = max(
1, int(self.context_size * prev_chunk.token_count)
)
else:
# No tokens to overlap; keep context size at 0.
effective_context_size = 0

Copilot uses AI. Check for mistakes.
Comment on lines 395 to 403
# Get the previous chunk
prev_chunk = chunks[i + 1]

# Calculate effective context size per chunk if context_size is a float
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * prev_chunk.token_count)

# Calculate the overlap context
context = self._get_suffix_overlap_context(prev_chunk, effective_context_size)
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _refine_suffix, prev_chunk = chunks[i + 1] is actually the next chunk (the source of the suffix context), so both the variable name and the comment “Get the previous chunk” are misleading. Renaming to next_chunk (and updating the comment accordingly) would make the control flow easier to follow and reduce mistakes when modifying this logic.

Suggested change
# Get the previous chunk
prev_chunk = chunks[i + 1]
# Calculate effective context size per chunk if context_size is a float
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * prev_chunk.token_count)
# Calculate the overlap context
context = self._get_suffix_overlap_context(prev_chunk, effective_context_size)
# Get the next chunk (source of the suffix context)
next_chunk = chunks[i + 1]
# Calculate effective context size per chunk if context_size is a float
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * next_chunk.token_count)
# Calculate the overlap context
context = self._get_suffix_overlap_context(next_chunk, effective_context_size)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to correctly address the issue of calculating overlap for a float context_size on a per-chunk basis. However, it introduces a potential Denial of Service (DoS) vulnerability in the _refine_prefix method. This vulnerability arises because the calculation of overlap size for a chunk depends on the modified token_count of the previous chunk, leading to cumulative growth of added context and potential excessive memory consumption. Additionally, there is duplicated code in _refine_prefix and _refine_suffix that should be refactored for better maintainability.

Comment on lines +302 to +303
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * prev_chunk.token_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The calculation of effective_context_size in _refine_prefix for a float context_size is based on the token_count of the previous chunk, which is modified in-place during iteration. This creates a cumulative growth of added context, potentially leading to a Denial of Service (DoS) due to excessive memory and CPU usage, especially with large context_size values and many chunks. To mitigate this, consider pre-calculating effective_context_size for all chunks based on their original sizes before the refinement loop. Additionally, this logic is duplicated in _refine_suffix (lines 399-400); extracting it into a shared helper method would improve maintainability and adhere to the DRY principle.

Comment on lines +399 to +400
if isinstance(self.context_size, float):
effective_context_size = int(self.context_size * prev_chunk.token_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic is a duplicate of the code in the _refine_prefix method (lines 302-303). To improve maintainability, this calculation should be extracted into a shared private helper method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants