feat: add token-budget based file reading with intelligent preview #8789

daniel-lxs · 2025-10-23T14:15:11Z

Problem

Files can fill the entire context window, and tiktoken crashes with RuntimeError: unreachable on files >5MB.

Solution

Token-budget based file reading with multi-layer protection against context overflow and tokenizer crashes.

How It Works

File Read Request
       ↓
Check file size
       ↓
   < 100KB? → Read normally (fast path)
       ↓
   > 5MB? → Preview first 100KB (crash prevention)
       ↓
Count tokens → Within budget? → Full content
                     ↓
                Exceeds budget? → Truncate to fit

Implementation

Layer 1: Fast Path (< 100KB)

Skip validation for small files - zero overhead ⚡

Layer 2: Token Validation (100KB - 5MB)

Dynamic budget: (contextWindow - currentTokens) * 0.6
Real token counting, smart truncation if needed 🎯

Layer 3: Preview Mode (> 5MB)

Return 100KB preview to prevent crashes
Suggests using line_range for targeted reading 👁️

Layer 4: Error Recovery

Catch tokenizer unreachable errors gracefully
Fallback to 100KB preview instead of crashing 🛡️

Benefits

✅ Dynamic budget based on actual context (no magic numbers)
✅ Real token counting using existing tokenizer
✅ 100KB previews for large files
✅ Graceful error handling prevents conversation crashes
✅ Simple (~160 lines) vs complex heuristics
✅ 17 comprehensive tests covering all scenarios

Testing

All 17 tests passing: fast path, budget validation, preview mode, error recovery, edge cases, unicode support.

Behavior:
- Implements token-budget based file reading in readFileTool.ts to prevent context overflow and tokenizer crashes.
- Introduces validateFileTokenBudget and truncateFileContent in fileTokenBudget.ts for handling large files.
- Handles files <100KB normally, previews first 100KB for files >5MB, and truncates content if it exceeds token budget.
Testing:
- Adds fileTokenBudget.spec.ts with 17 tests covering scenarios like small files, large files, budget validation, and error handling.
- Updates line-counter.spec.ts to test line and token counting with error handling.
Misc:
- Updates line-counter.ts to include token estimation alongside line counting.
- Modifies readFileTool.spec.ts to test new file reading behavior.

^{This description was created by}^{for 169fb35. You can customize this summary. It will automatically update as commits are pushed.}

Implements a simple, token-budget based file reading system that prevents context window overflow and tokenizer crashes. Problem: - Files could fill entire context window causing issues - tiktoken crashes with 'unreachable' error on files >5MB - PR #6667's approach was too complex with magic numbers Solution - Multi-Layer Defense: 1. Fast path: Files <100KB skip validation (no overhead) 2. Token validation: 100KB-5MB files use real token counting - Budget: (contextWindow - currentTokens) * 0.6 - Smart truncation if exceeds budget 3. Preview mode: Files >5MB get 100KB preview (prevents crashes) 4. Error recovery: Catch tokenizer 'unreachable' errors gracefully Key Features: - No magic numbers - dynamic based on actual context - Real token counting using existing tokenizer - 100KB previews for large files (perfect size for structure visibility) - Graceful error handling prevents conversation crashes - Simple implementation (~160 lines vs complex heuristics) Testing: - 17 comprehensive tests covering all scenarios - All tests passing including edge cases and error conditions Files: - src/core/tools/helpers/fileTokenBudget.ts: Core validation logic - src/core/tools/helpers/__tests__/fileTokenBudget.spec.ts: Test suite - src/core/tools/readFileTool.ts: Integration into read file tool

roomote · 2025-10-23T14:15:34Z

Starting my review of this PR—comments incoming soon! 🚀

Follow Along on Roo Code Cloud

Issues to Address

All previously flagged issues have been resolved:

The lines attribute shows totalLines from the original file even after content is truncated (see inline comment on readFileTool.ts:619)
Empty lines are being filtered out when counting displayed lines, causing incorrect line count in the lines attribute (see inline comment on readFileTool.ts:620-621)
Line counting logic has off-by-one error for files ending with newline characters (see inline comment on readFileTool.ts:620)

Latest Review: No new issues found. All previous concerns have been properly addressed.

Improvements: - Preview files (>5MB) now use token counting to respect budget - Read only 100KB preview initially, then validate with tokenizer - If preview exceeds budget, truncate accordingly - Better error handling with conservative character-based estimation - All 17 tests passing

src/core/tools/readFileTool.ts

- Added getTokenUsage mock to createMockCline for readFileTool tests - Added contextWindow to model info mock - Updated fileTokenBudget test expectations for error handling - All 59 tests now passing (42 readFileTool + 17 fileTokenBudget)

…ncation - Previously used original file totalLines, causing mismatch after truncation - Now computes displayedLines from truncated content and sets lines="1-N" - Prevents LLM referencing non-existent line numbers - All tests passing (59/59)

src/core/tools/readFileTool.ts

…ation - Count all lines (including empty) when computing lines="1-N" - Prevents under-reporting when truncated preview contains blank lines - Tests remain green (42/42 readFileTool, 17/17 fileTokenBudget)

Integrated countFileLinesAndTokens into validateFileTokenBudget: - Streams file once with chunked token estimation (256-line chunks) - Early exits when budget exceeded (saves I/O and memory) - Preserves all existing safety checks: - Fast path for <100KB files - Preview mode for >5MB files - Error handling for tokenizer crashes - Fallback to full read if streaming fails Benefits: - Single file pass with early exit vs full read + tokenize - Prevents loading large files into memory unnecessarily - Conservative fallback on tokenizer errors (2 chars = 1 token) - All existing tests passing (59/59) Files: - src/integrations/misc/line-counter.ts: Added countFileLinesAndTokens() - src/core/tools/helpers/fileTokenBudget.ts: Integrated streaming - src/integrations/misc/__tests__/line-counter.spec.ts: Basic tests

src/core/tools/helpers/fileTokenBudget.ts

…tive estimate

src/core/tools/readFileTool.ts

Two fixes: 1. Line counting off-by-one: Files ending with \n now count correctly - "line1\nline2\n" now correctly shows lines="1-2" not lines="1-3" - Consistent with countFileLines() behavior - Prevents LLM confusion about line numbers 2. Fixed line-counter.spec.ts mocking: - Use proper Readable stream instead of mock object - Properly mock fs.createReadStream with stream interface - All 63 tests passing (42 readFileTool + 17 fileTokenBudget + 4 line-counter) Files changed: - src/core/tools/readFileTool.ts: Handle trailing newline in line count - src/integrations/misc/__tests__/line-counter.spec.ts: Fix stream mocking

src/core/tools/readFileTool.ts

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

daniel-lxs requested review from cte, jr and mrubens as code owners October 23, 2025 14:15

github-project-automation bot added this to Roo Code Roadmap and Roo Code Roadmap Oct 23, 2025

github-project-automation bot moved this to Triage in Roo Code Roadmap Oct 23, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Oct 23, 2025

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Oct 23, 2025

hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Oct 23, 2025

roomote bot reviewed Oct 23, 2025

View reviewed changes

src/core/tools/readFileTool.ts Outdated Show resolved Hide resolved

daniel-lxs added 2 commits October 23, 2025 09:42

daniel-lxs moved this from Triage to PR [Needs Review] in Roo Code Roadmap Oct 23, 2025

roomote bot reviewed Oct 23, 2025

View reviewed changes

src/core/tools/readFileTool.ts Outdated Show resolved Hide resolved

fix(read_file): count empty lines in displayed line range after trunc…

f9ede9a

…ation - Count all lines (including empty) when computing lines="1-N" - Prevents under-reporting when truncated preview contains blank lines - Tests remain green (42/42 readFileTool, 17/17 fileTokenBudget)

RooCodeInc deleted a comment from roomote bot Oct 23, 2025

hannesrudolph added PR - Needs Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Oct 23, 2025

roomote bot approved these changes Oct 23, 2025

View reviewed changes

jr approved these changes Oct 23, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 23, 2025

ellipsis-dev bot reviewed Oct 23, 2025

View reviewed changes

src/core/tools/helpers/fileTokenBudget.ts Show resolved Hide resolved

fix: update token estimation logic on tokenizer error to use conserva…

100afdf

…tive estimate

roomote bot reviewed Oct 23, 2025

View reviewed changes

src/core/tools/readFileTool.ts Outdated Show resolved Hide resolved

ellipsis-dev bot reviewed Oct 23, 2025

View reviewed changes

src/core/tools/readFileTool.ts Outdated Show resolved Hide resolved

Update src/core/tools/readFileTool.ts

b6b7587

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

roomote bot approved these changes Oct 23, 2025

View reviewed changes

mrubens approved these changes Oct 23, 2025

View reviewed changes

mrubens merged commit 93c13e2 into main Oct 23, 2025
9 checks passed

github-project-automation bot moved this from New to Done in Roo Code Roadmap Oct 23, 2025

github-project-automation bot moved this from PR [Needs Review] to Done in Roo Code Roadmap Oct 23, 2025

mrubens deleted the feat/token-budget-file-reading branch October 23, 2025 19:35

daniel-lxs mentioned this pull request Oct 23, 2025

[ENHANCEMENT] Provider‑aware large file reads to prevent context overload #8038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add token-budget based file reading with intelligent preview #8789

feat: add token-budget based file reading with intelligent preview #8789

Uh oh!

daniel-lxs commented Oct 23, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

roomote bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

feat: add token-budget based file reading with intelligent preview #8789

feat: add token-budget based file reading with intelligent preview #8789

Uh oh!

Conversation

daniel-lxs commented Oct 23, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

How It Works

Implementation

Benefits

Testing

Related

Uh oh!

roomote bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues to Address

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

daniel-lxs commented Oct 23, 2025 •

edited by ellipsis-dev bot

Loading

roomote bot commented Oct 23, 2025 •

edited

Loading