Fixed per-chunk overlap calculation for float context_size by anaslimem · Pull Request #512 · chonkie-inc/chonkie

anaslimem · 2026-03-01T17:36:33Z

Fix Float Context Size Calculation in OverlapRefinery

Summary

Fixed the float context_size calculation in OverlapRefinery to compute overlap per-chunk instead of using a fixed size for all chunks.

Problem

When using a float context_size (e.g., 0.25 for 25% overlap), the previous implementation calculated overlap using the maximum token count across all chunks:

# Old implementation (line 422-423)
if isinstance(self.context_size, float):
    return int(self.context_size * max(chunk.token_count for chunk in chunks))

This meant every chunk received the same overlap amount (25% of the largest chunk), regardless of its actual size.

Example of the Bug

Chunks: [100 tokens, 200 tokens, 300 tokens]
context_size = 0.25 (intended: 25% per chunk)

Chunk	Old Behavior	Expected
Chunk 1 (100 tokens)	75 tokens (75%)	25 tokens (25%)
Chunk 2 (200 tokens)	75 tokens (37.5%)	50 tokens (25%)
Chunk 3 (300 tokens)	75 tokens (25%)	75 tokens (25%)

Smaller chunks were getting proportionally more overlap than intended.

Solution

Modified _refine_prefix() and _refine_suffix() to calculate effective context size per-chunk when context_size is a float:

# New implementation
if isinstance(self.context_size, float):
    effective_context_size = int(self.context_size * prev_chunk.token_count)

Files Changed

src/chonkie/refinery/overlap.py
- _refine_prefix() method (lines 298-301)
- _refine_suffix() method (lines 395-398)

Testing

All 40 existing tests pass:

tests/refinery/test_overlap_refinery.py ............ 40 passed

Backwards Compatibility

This fix maintains backwards compatibility for integer context_size values. Only float values benefit from the corrected per-chunk calculation.

gemini-code-assist · 2026-03-01T17:36:46Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the OverlapRefinery where a float context_size was being misapplied uniformly across all chunks based on the largest chunk's token count. The fix ensures that the overlap is now calculated proportionally for each individual chunk, leading to accurate and consistent context handling, particularly for datasets with varying chunk sizes.

Highlights

Float Context Size Calculation: Corrected the OverlapRefinery to calculate overlap per-chunk when context_size is a float, addressing an issue where smaller chunks received disproportionately more overlap than intended.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/chonkie/refinery/overlap.py
- Modified _refine_prefix() to calculate effective_context_size per-chunk for float context_size values.
- Modified _refine_suffix() to calculate effective_context_size per-chunk for float context_size values.

Activity

All 40 existing tests for test_overlap_refinery.py passed, confirming the fix did not introduce regressions and maintains backwards compatibility for integer context_size values.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR fixes how OverlapRefinery interprets a float context_size by computing overlap size per source chunk (instead of using a single overlap size derived from the maximum chunk size), aligning overlap proportion with each chunk’s own token count.

Changes:

Update _refine_prefix() to compute effective_context_size per-chunk when context_size is a float.
Update _refine_suffix() to compute effective_context_size per-chunk when context_size is a float.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T17:39:53Z

src/chonkie/refinery/overlap.py

+            # Calculate effective context size per chunk if context_size is a float
+            if isinstance(self.context_size, float):
+                effective_context_size = int(self.context_size * prev_chunk.token_count)
+


The new per-chunk float context sizing behavior isn’t directly asserted in tests right now (existing tests mainly check that context exists). Add a focused test that uses uneven token_counts and asserts the computed overlap length is proportional per chunk (including a prefix+merge case to guard against token_count mutation affecting later iterations).

Copilot · 2026-03-01T17:39:53Z

src/chonkie/refinery/overlap.py


+            # Calculate effective context size per chunk if context_size is a float
+            if isinstance(self.context_size, float):
+                effective_context_size = int(self.context_size * prev_chunk.token_count)
+


In _refine_prefix, effective_context_size is derived from prev_chunk.token_count, but in prefix+merge mode each chunk’s token_count is increased when it receives context. On the next iteration that mutated token_count is used as the basis for overlap, which can inflate overlap sizes and potentially cause context “cascade” across chunks. Consider snapshotting the original token counts before mutating chunks (e.g., original_token_counts = [c.token_count for c in chunks]) and using that snapshot for float calculations, so per-chunk overlap is based on the original chunk size.

Copilot · 2026-03-01T17:39:53Z

src/chonkie/refinery/overlap.py


+            # Calculate effective context size per chunk if context_size is a float
+            if isinstance(self.context_size, float):
+                effective_context_size = int(self.context_size * prev_chunk.token_count)


The float -> int conversion can yield effective_context_size == 0 for small chunks (e.g., 0.25 * 1 -> 0). In prefix token overlap, tokens[-0:] returns the full token list, so a 0 overlap becomes “entire chunk” overlap; in recursive mode an effective_context_size of 0 can also break _split_text when it falls back to token splitting (range step 0). Clamp the computed value to at least 1 (or use math.ceil with a minimum of 1) when prev_chunk.token_count > 0.

Suggested change

effective_context_size = int(self.context_size * prev_chunk.token_count)

# Ensure that for non-empty chunks the effective context size is at least 1.

# This avoids cases where int(self.context_size * token_count) becomes 0,

# which would otherwise lead to unintended "entire chunk" overlap or

# downstream errors in recursive splitting logic.

if prev_chunk.token_count > 0:

effective_context_size = max(

1, int(self.context_size * prev_chunk.token_count)

)

else:

# No tokens to overlap; keep context size at 0.

effective_context_size = 0

Copilot · 2026-03-01T17:39:53Z

src/chonkie/refinery/overlap.py

            # Get the previous chunk
            prev_chunk = chunks[i + 1]

+            # Calculate effective context size per chunk if context_size is a float
+            if isinstance(self.context_size, float):
+                effective_context_size = int(self.context_size * prev_chunk.token_count)
+
            # Calculate the overlap context
            context = self._get_suffix_overlap_context(prev_chunk, effective_context_size)


In _refine_suffix, prev_chunk = chunks[i + 1] is actually the next chunk (the source of the suffix context), so both the variable name and the comment “Get the previous chunk” are misleading. Renaming to next_chunk (and updating the comment accordingly) would make the control flow easier to follow and reduce mistakes when modifying this logic.

Suggested change

# Get the previous chunk

prev_chunk = chunks[i + 1]

# Calculate effective context size per chunk if context_size is a float

if isinstance(self.context_size, float):

effective_context_size = int(self.context_size * prev_chunk.token_count)

# Calculate the overlap context

context = self._get_suffix_overlap_context(prev_chunk, effective_context_size)

# Get the next chunk (source of the suffix context)

next_chunk = chunks[i + 1]

# Calculate effective context size per chunk if context_size is a float

if isinstance(self.context_size, float):

effective_context_size = int(self.context_size * next_chunk.token_count)

# Calculate the overlap context

context = self._get_suffix_overlap_context(next_chunk, effective_context_size)

gemini-code-assist

Code Review

This pull request aims to correctly address the issue of calculating overlap for a float context_size on a per-chunk basis. However, it introduces a potential Denial of Service (DoS) vulnerability in the _refine_prefix method. This vulnerability arises because the calculation of overlap size for a chunk depends on the modified token_count of the previous chunk, leading to cumulative growth of added context and potential excessive memory consumption. Additionally, there is duplicated code in _refine_prefix and _refine_suffix that should be refactored for better maintainability.

gemini-code-assist · 2026-03-01T17:43:18Z

src/chonkie/refinery/overlap.py

+            if isinstance(self.context_size, float):
+                effective_context_size = int(self.context_size * prev_chunk.token_count)


The calculation of effective_context_size in _refine_prefix for a float context_size is based on the token_count of the previous chunk, which is modified in-place during iteration. This creates a cumulative growth of added context, potentially leading to a Denial of Service (DoS) due to excessive memory and CPU usage, especially with large context_size values and many chunks. To mitigate this, consider pre-calculating effective_context_size for all chunks based on their original sizes before the refinement loop. Additionally, this logic is duplicated in _refine_suffix (lines 399-400); extracting it into a shared helper method would improve maintainability and adhere to the DRY principle.

gemini-code-assist · 2026-03-01T17:43:18Z

src/chonkie/refinery/overlap.py

+            if isinstance(self.context_size, float):
+                effective_context_size = int(self.context_size * prev_chunk.token_count)


This logic is a duplicate of the code in the _refine_prefix method (lines 302-303). To improve maintainability, this calculation should be extracted into a shared private helper method.

Fixed per-chunk overlap calculation for float context_size

523250f

Copilot AI review requested due to automatic review settings March 1, 2026 17:36

Copilot started reviewing on behalf of anaslimem March 1, 2026 17:36 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed per-chunk overlap calculation for float context_size#512

Fixed per-chunk overlap calculation for float context_size#512
anaslimem wants to merge 1 commit intochonkie-inc:mainfrom
anaslimem:fix/float-context-size-overlap

anaslimem commented Mar 1, 2026

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 1, 2026

Uh oh!

gemini-code-assist bot Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                effective_context_size = int(self.context_size * prev_chunk.token_count)
+                # Ensure that for non-empty chunks the effective context size is at least 1.
+                # This avoids cases where int(self.context_size * token_count) becomes 0,
+                # which would otherwise lead to unintended "entire chunk" overlap or
+                # downstream errors in recursive splitting logic.
+                if prev_chunk.token_count > 0:
+                    effective_context_size = max(
+, int(self.context_size * prev_chunk.token_count)
+                    )
+                else:
+                    # No tokens to overlap; keep context size at 0.
+                    effective_context_size = 0

		if isinstance(self.context_size, float):
		effective_context_size = int(self.context_size * prev_chunk.token_count)

Conversation

anaslimem commented Mar 1, 2026

Fix Float Context Size Calculation in OverlapRefinery

Summary

Problem

Example of the Bug

Solution

Files Changed

Testing

Backwards Compatibility

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants