forked from cline/cline
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: add token-budget based file reading with intelligent preview #8789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implements a simple, token-budget based file reading system that prevents context window overflow and tokenizer crashes. Problem: - Files could fill entire context window causing issues - tiktoken crashes with 'unreachable' error on files >5MB - PR #6667's approach was too complex with magic numbers Solution - Multi-Layer Defense: 1. Fast path: Files <100KB skip validation (no overhead) 2. Token validation: 100KB-5MB files use real token counting - Budget: (contextWindow - currentTokens) * 0.6 - Smart truncation if exceeds budget 3. Preview mode: Files >5MB get 100KB preview (prevents crashes) 4. Error recovery: Catch tokenizer 'unreachable' errors gracefully Key Features: - No magic numbers - dynamic based on actual context - Real token counting using existing tokenizer - 100KB previews for large files (perfect size for structure visibility) - Graceful error handling prevents conversation crashes - Simple implementation (~160 lines vs complex heuristics) Testing: - 17 comprehensive tests covering all scenarios - All tests passing including edge cases and error conditions Files: - src/core/tools/helpers/fileTokenBudget.ts: Core validation logic - src/core/tools/helpers/__tests__/fileTokenBudget.spec.ts: Test suite - src/core/tools/readFileTool.ts: Integration into read file tool
|
Starting my review of this PR—comments incoming soon! 🚀 Follow Along on Roo Code Cloud Issues to AddressAll previously flagged issues have been resolved:
Latest Review: No new issues found. All previous concerns have been properly addressed. |
Improvements: - Preview files (>5MB) now use token counting to respect budget - Read only 100KB preview initially, then validate with tokenizer - If preview exceeds budget, truncate accordingly - Better error handling with conservative character-based estimation - All 17 tests passing
- Added getTokenUsage mock to createMockCline for readFileTool tests - Added contextWindow to model info mock - Updated fileTokenBudget test expectations for error handling - All 59 tests now passing (42 readFileTool + 17 fileTokenBudget)
…ncation - Previously used original file totalLines, causing mismatch after truncation - Now computes displayedLines from truncated content and sets lines="1-N" - Prevents LLM referencing non-existent line numbers - All tests passing (59/59)
…ation - Count all lines (including empty) when computing lines="1-N" - Prevents under-reporting when truncated preview contains blank lines - Tests remain green (42/42 readFileTool, 17/17 fileTokenBudget)
jr
approved these changes
Oct 23, 2025
Integrated countFileLinesAndTokens into validateFileTokenBudget: - Streams file once with chunked token estimation (256-line chunks) - Early exits when budget exceeded (saves I/O and memory) - Preserves all existing safety checks: - Fast path for <100KB files - Preview mode for >5MB files - Error handling for tokenizer crashes - Fallback to full read if streaming fails Benefits: - Single file pass with early exit vs full read + tokenize - Prevents loading large files into memory unnecessarily - Conservative fallback on tokenizer errors (2 chars = 1 token) - All existing tests passing (59/59) Files: - src/integrations/misc/line-counter.ts: Added countFileLinesAndTokens() - src/core/tools/helpers/fileTokenBudget.ts: Integrated streaming - src/integrations/misc/__tests__/line-counter.spec.ts: Basic tests
Two fixes: 1. Line counting off-by-one: Files ending with \n now count correctly - "line1\nline2\n" now correctly shows lines="1-2" not lines="1-3" - Consistent with countFileLines() behavior - Prevents LLM confusion about line numbers 2. Fixed line-counter.spec.ts mocking: - Use proper Readable stream instead of mock object - Properly mock fs.createReadStream with stream interface - All 63 tests passing (42 readFileTool + 17 fileTokenBudget + 4 line-counter) Files changed: - src/core/tools/readFileTool.ts: Handle trailing newline in line count - src/integrations/misc/__tests__/line-counter.spec.ts: Fix stream mocking
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
mrubens
approved these changes
Oct 23, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
New feature or request
lgtm
This PR has been approved by a maintainer
PR - Needs Review
size:XL
This PR changes 500-999 lines, ignoring generated files.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Files can fill the entire context window, and tiktoken crashes with
RuntimeError: unreachableon files >5MB.Solution
Token-budget based file reading with multi-layer protection against context overflow and tokenizer crashes.
How It Works
Implementation
Layer 1: Fast Path (< 100KB)
Layer 2: Token Validation (100KB - 5MB)
(contextWindow - currentTokens) * 0.6Layer 3: Preview Mode (> 5MB)
line_rangefor targeted reading 👁️Layer 4: Error Recovery
unreachableerrors gracefullyBenefits
✅ Dynamic budget based on actual context (no magic numbers)
✅ Real token counting using existing tokenizer
✅ 100KB previews for large files
✅ Graceful error handling prevents conversation crashes
✅ Simple (~160 lines) vs complex heuristics
✅ 17 comprehensive tests covering all scenarios
Testing
All 17 tests passing: fast path, budget validation, preview mode, error recovery, edge cases, unicode support.
Related
Closes #6667
Important
Introduces token-budget based file reading in
readFileTool.tsto handle large files efficiently, with new functions for budget validation and content truncation, and comprehensive tests.readFileTool.tsto prevent context overflow and tokenizer crashes.validateFileTokenBudgetandtruncateFileContentinfileTokenBudget.tsfor handling large files.fileTokenBudget.spec.tswith 17 tests covering scenarios like small files, large files, budget validation, and error handling.line-counter.spec.tsto test line and token counting with error handling.line-counter.tsto include token estimation alongside line counting.readFileTool.spec.tsto test new file reading behavior.This description was created by
for 169fb35. You can customize this summary. It will automatically update as commits are pushed.