Skip to content

Conversation

@liwilliam2021
Copy link
Contributor

@liwilliam2021 liwilliam2021 commented Aug 4, 2025

Implemented a simple version of: #6319

We now always stop reading after a limited read. We also use simpler heuristics that do not call the tokenizer to validate while remaining conservative.

Goal: prevent infinite hanging on large files when partial reads are off. This may limit the ability of the model to read large files which is addressed in the next PR.


Important

Introduces file size validation in readFileTool.ts to prevent crashes from large files by limiting reads based on context size.

  • Behavior:
    • Introduces file size validation in readFileTool.ts using validateFileSizeForContext from new contextValidator.ts.
    • Stops reading files when they exceed a safe content limit to prevent context overflow.
    • Adds notices for partial reads due to context limitations.
  • Tests:
    • Updates readFileTool.spec.ts to test new validation logic and partial read notices.
    • Adds tests for maxChars parameter in read-lines.spec.ts to ensure character limit handling.
  • Localization:
    • Adds showingOnlyLines translation in multiple tools.json files for different languages.

This description was created by Ellipsis for 75fb09a. You can customize this summary. It will automatically update as commits are pushed.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. bug Something isn't working labels Aug 4, 2025
@liwilliam2021 liwilliam2021 changed the title Will/max read fix fix: stop reading big files that crash context Aug 4, 2025
Copy link
Contributor

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I've reviewed the changes and the implementation looks solid overall. The approach to prevent infinite hanging on large files is well thought out. I've left some suggestions inline that could improve performance and robustness.

* Conservative buffer percentage for file reading.
* We use a very conservative estimate to ensure files fit in context.
*/
const FILE_READ_BUFFER_PERCENTAGE = 0.4 // 40% buffer for safety
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 40% buffer intentionally this conservative? It might be worth making this configurable or adjusting based on model capabilities. Some models might handle closer-to-limit content better than others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we shouldn’t need to be so conservative here if the rest of the logic is working right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sorry-- I think I just picked a big number for the simple version

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Aug 4, 2025
- Fix Hindi translation punctuation
- Fix race condition by checking stream.destroyed
- Optimize newline counting with regex
- Performance improvements for large file handling
- Defensive programming for end parameter already in place
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Aug 5, 2025
@hannesrudolph hannesrudolph removed the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Aug 5, 2025
Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liwilliam2021 this finally defeated my file

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 5, 2025
@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Needs Review] in Roo Code Roadmap Aug 5, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Aug 6, 2025
const remainingTokens = contextWindow - currentlyUsed
const usableTokens = Math.floor(remainingTokens * (1 - FILE_READ_BUFFER_PERCENTAGE))

// Reserve space for response (use 25% of remaining or 4096, whichever is smaller)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the common logic for this

// For large files or when approaching limits, always limit
if (fileSizeBytes > safeCharLimit || fileSizeBytes > LARGE_FILE_SIZE) {
// Use a very conservative limit
const finalLimit = Math.min(safeCharLimit, 100000) // Cap at 100K chars
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might annoy people who are trying to use a model with a large context window to read large files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plan for this PR was to do something stupid to fix the temporary error-- basically never reading big files

This PR has the full implementation and doesn't have that limit
#6319

@daniel-lxs daniel-lxs moved this from PR [Needs Review] to PR [Changes Requested] in Roo Code Roadmap Aug 23, 2025
@hannesrudolph hannesrudolph moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Sep 23, 2025
@hannesrudolph
Copy link
Collaborator

Let's get this across the finish line

daniel-lxs added a commit that referenced this pull request Oct 23, 2025
Implements a simple, token-budget based file reading system that prevents
context window overflow and tokenizer crashes.

Problem:
- Files could fill entire context window causing issues
- tiktoken crashes with 'unreachable' error on files >5MB
- PR #6667's approach was too complex with magic numbers

Solution - Multi-Layer Defense:
1. Fast path: Files <100KB skip validation (no overhead)
2. Token validation: 100KB-5MB files use real token counting
   - Budget: (contextWindow - currentTokens) * 0.6
   - Smart truncation if exceeds budget
3. Preview mode: Files >5MB get 100KB preview (prevents crashes)
4. Error recovery: Catch tokenizer 'unreachable' errors gracefully

Key Features:
- No magic numbers - dynamic based on actual context
- Real token counting using existing tokenizer
- 100KB previews for large files (perfect size for structure visibility)
- Graceful error handling prevents conversation crashes
- Simple implementation (~160 lines vs complex heuristics)

Testing:
- 17 comprehensive tests covering all scenarios
- All tests passing including edge cases and error conditions

Files:
- src/core/tools/helpers/fileTokenBudget.ts: Core validation logic
- src/core/tools/helpers/__tests__/fileTokenBudget.spec.ts: Test suite
- src/core/tools/readFileTool.ts: Integration into read file tool
@github-project-automation github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Oct 23, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer PR - Needs Preliminary Review size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants