Skip to content

Conversation

@daniel-lxs
Copy link
Member

@daniel-lxs daniel-lxs commented Oct 23, 2025

Problem

Files can fill the entire context window, and tiktoken crashes with RuntimeError: unreachable on files >5MB.

Solution

Token-budget based file reading with multi-layer protection against context overflow and tokenizer crashes.

How It Works

File Read Request
       ↓
Check file size
       ↓
   < 100KB? → Read normally (fast path)
       ↓
   > 5MB? → Preview first 100KB (crash prevention)
       ↓
Count tokens → Within budget? → Full content
                     ↓
                Exceeds budget? → Truncate to fit

Implementation

Layer 1: Fast Path (< 100KB)

  • Skip validation for small files - zero overhead ⚡

Layer 2: Token Validation (100KB - 5MB)

  • Dynamic budget: (contextWindow - currentTokens) * 0.6
  • Real token counting, smart truncation if needed 🎯

Layer 3: Preview Mode (> 5MB)

  • Return 100KB preview to prevent crashes
  • Suggests using line_range for targeted reading 👁️

Layer 4: Error Recovery

  • Catch tokenizer unreachable errors gracefully
  • Fallback to 100KB preview instead of crashing 🛡️

Benefits

✅ Dynamic budget based on actual context (no magic numbers)
✅ Real token counting using existing tokenizer
✅ 100KB previews for large files
✅ Graceful error handling prevents conversation crashes
✅ Simple (~160 lines) vs complex heuristics
✅ 17 comprehensive tests covering all scenarios

Testing

All 17 tests passing: fast path, budget validation, preview mode, error recovery, edge cases, unicode support.

Related

Closes #6667


Important

Introduces token-budget based file reading in readFileTool.ts to handle large files efficiently, with new functions for budget validation and content truncation, and comprehensive tests.

  • Behavior:
    • Implements token-budget based file reading in readFileTool.ts to prevent context overflow and tokenizer crashes.
    • Introduces validateFileTokenBudget and truncateFileContent in fileTokenBudget.ts for handling large files.
    • Handles files <100KB normally, previews first 100KB for files >5MB, and truncates content if it exceeds token budget.
  • Testing:
    • Adds fileTokenBudget.spec.ts with 17 tests covering scenarios like small files, large files, budget validation, and error handling.
    • Updates line-counter.spec.ts to test line and token counting with error handling.
  • Misc:
    • Updates line-counter.ts to include token estimation alongside line counting.
    • Modifies readFileTool.spec.ts to test new file reading behavior.

This description was created by Ellipsis for 169fb35. You can customize this summary. It will automatically update as commits are pushed.

Implements a simple, token-budget based file reading system that prevents
context window overflow and tokenizer crashes.

Problem:
- Files could fill entire context window causing issues
- tiktoken crashes with 'unreachable' error on files >5MB
- PR #6667's approach was too complex with magic numbers

Solution - Multi-Layer Defense:
1. Fast path: Files <100KB skip validation (no overhead)
2. Token validation: 100KB-5MB files use real token counting
   - Budget: (contextWindow - currentTokens) * 0.6
   - Smart truncation if exceeds budget
3. Preview mode: Files >5MB get 100KB preview (prevents crashes)
4. Error recovery: Catch tokenizer 'unreachable' errors gracefully

Key Features:
- No magic numbers - dynamic based on actual context
- Real token counting using existing tokenizer
- 100KB previews for large files (perfect size for structure visibility)
- Graceful error handling prevents conversation crashes
- Simple implementation (~160 lines vs complex heuristics)

Testing:
- 17 comprehensive tests covering all scenarios
- All tests passing including edge cases and error conditions

Files:
- src/core/tools/helpers/fileTokenBudget.ts: Core validation logic
- src/core/tools/helpers/__tests__/fileTokenBudget.spec.ts: Test suite
- src/core/tools/readFileTool.ts: Integration into read file tool
@daniel-lxs daniel-lxs requested review from cte, jr and mrubens as code owners October 23, 2025 14:15
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Oct 23, 2025
@roomote
Copy link

roomote bot commented Oct 23, 2025

Starting my review of this PR—comments incoming soon! 🚀

Follow Along on Roo Code Cloud


Issues to Address

All previously flagged issues have been resolved:

  • The lines attribute shows totalLines from the original file even after content is truncated (see inline comment on readFileTool.ts:619)
  • Empty lines are being filtered out when counting displayed lines, causing incorrect line count in the lines attribute (see inline comment on readFileTool.ts:620-621)
  • Line counting logic has off-by-one error for files ending with newline characters (see inline comment on readFileTool.ts:620)

Latest Review: No new issues found. All previous concerns have been properly addressed.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Oct 23, 2025
Improvements:
- Preview files (>5MB) now use token counting to respect budget
- Read only 100KB preview initially, then validate with tokenizer
- If preview exceeds budget, truncate accordingly
- Better error handling with conservative character-based estimation
- All 17 tests passing
- Added getTokenUsage mock to createMockCline for readFileTool tests
- Added contextWindow to model info mock
- Updated fileTokenBudget test expectations for error handling
- All 59 tests now passing (42 readFileTool + 17 fileTokenBudget)
…ncation

- Previously used original file totalLines, causing mismatch after truncation
- Now computes displayedLines from truncated content and sets lines="1-N"
- Prevents LLM referencing non-existent line numbers
- All tests passing (59/59)
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Review] in Roo Code Roadmap Oct 23, 2025
…ation

- Count all lines (including empty) when computing lines="1-N"
- Prevents under-reporting when truncated preview contains blank lines
- Tests remain green (42/42 readFileTool, 17/17 fileTokenBudget)
@RooCodeInc RooCodeInc deleted a comment from roomote bot Oct 23, 2025
@hannesrudolph hannesrudolph added PR - Needs Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Oct 23, 2025
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 23, 2025
Integrated countFileLinesAndTokens into validateFileTokenBudget:
- Streams file once with chunked token estimation (256-line chunks)
- Early exits when budget exceeded (saves I/O and memory)
- Preserves all existing safety checks:
  - Fast path for <100KB files
  - Preview mode for >5MB files
  - Error handling for tokenizer crashes
  - Fallback to full read if streaming fails

Benefits:
- Single file pass with early exit vs full read + tokenize
- Prevents loading large files into memory unnecessarily
- Conservative fallback on tokenizer errors (2 chars = 1 token)
- All existing tests passing (59/59)

Files:
- src/integrations/misc/line-counter.ts: Added countFileLinesAndTokens()
- src/core/tools/helpers/fileTokenBudget.ts: Integrated streaming
- src/integrations/misc/__tests__/line-counter.spec.ts: Basic tests
Two fixes:
1. Line counting off-by-one: Files ending with \n now count correctly
   - "line1\nline2\n" now correctly shows lines="1-2" not lines="1-3"
   - Consistent with countFileLines() behavior
   - Prevents LLM confusion about line numbers

2. Fixed line-counter.spec.ts mocking:
   - Use proper Readable stream instead of mock object
   - Properly mock fs.createReadStream with stream interface
   - All 63 tests passing (42 readFileTool + 17 fileTokenBudget + 4 line-counter)

Files changed:
- src/core/tools/readFileTool.ts: Handle trailing newline in line count
- src/integrations/misc/__tests__/line-counter.spec.ts: Fix stream mocking
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
@mrubens mrubens merged commit 93c13e2 into main Oct 23, 2025
9 checks passed
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Oct 23, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Review] to Done in Roo Code Roadmap Oct 23, 2025
@mrubens mrubens deleted the feat/token-budget-file-reading branch October 23, 2025 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer PR - Needs Review size:XL This PR changes 500-999 lines, ignoring generated files.