-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add safeguard for large files in readFileTool when maxReadFileLine is -1 #6174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ine is -1 - Add token counting check using tiktoken for files over 1000 lines - Automatically switch to partial read (first 2000 lines) when token count exceeds 50k - Add fallback safeguard for very large files (>5000 lines) when token counting fails - Include informative notice explaining why partial read is being used - Add comprehensive test coverage for all safeguard scenarios This prevents consuming the entire context window when reading very large files.
|
@roomote-agent This looks promising, but can we get access to the actual context window size of the current model via the reference to the Task object and use that instead of assuming a 100K window size? Additionally, 1,000 lines is a bit low; let's bump that up to 10K lines. |
|
Hi @cte! I see your feedback. Ill work on:
Let me check out the branch and make these changes. |
src/core/tools/readFileTool.ts
Outdated
| const lineRangeAttr = ` lines="1-${totalLines}"` | ||
| // Handle normal file read with safeguard for large files | ||
| // Define thresholds for the safeguard | ||
| const LARGE_FILE_LINE_THRESHOLD = 1000 // Consider files with more than 1000 lines as "large" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving the safeguard threshold constants (LARGE_FILE_LINE_THRESHOLD, MAX_TOKEN_THRESHOLD, FALLBACK_MAX_LINES) to a shared configuration or constants module to avoid magic numbers and improve maintainability.
This comment was generated because it violated a code review rule: irule_tTqpIuNs8DV0QFGj.
src/core/tools/readFileTool.ts
Outdated
| if (tokenCount > MAX_TOKEN_THRESHOLD) { | ||
| shouldApplySafeguard = true | ||
| linesToRead = FALLBACK_MAX_LINES | ||
| safeguardNotice = `<notice>This file contains ${totalLines} lines and approximately ${tokenCount.toLocaleString()} tokens, which could consume a significant portion of the context window. Showing only the first ${FALLBACK_MAX_LINES} lines to preserve context space. Use line_range if you need to read specific sections.</notice>\n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safeguard notice message is hardcoded in English. For internationalization, consider using the i18n function (e.g., t()) so that the message can be translated for all supported languages.
This comment was generated because it violated a code review rule: irule_C0ez7Rji6ANcGkkX.
|
@roomote-agent Instead of line count let's use file size to detect files that potentially contain a lot of tokens; this should handle the case where there's a file with a huge amount of content on a single line. |
- Access actual context window size from Task object via cline.api.getModel().info.contextWindow - Increase LARGE_FILE_LINE_THRESHOLD from 1,000 to 10,000 lines - Calculate MAX_TOKEN_THRESHOLD as 50% of actual context window instead of hardcoded 50K - Update tests to reflect new thresholds and mock api.getModel() properly
|
Thanks for the feedback @cte! I understand - switching from line count to file size makes sense to better handle files with large amounts of content on single lines. I'll update the implementation to use file size as the primary detection method for potentially token-heavy files. Working on this now. |
|
Hi @cte! Ive successfully implemented both requested changes: ✅ Changes made:
📝 Implementation details:
🔍 CI Status:
The changes ensure that the safeguard adapts to different model context windows while being less aggressive about triggering for moderately-sized files. |
- Replace line count threshold with file size threshold (100KB) - Files larger than 100KB trigger token count check - Files larger than 1MB automatically apply safeguard if token counting fails - Update tests to reflect new file size-based approach - This better handles files with large amounts of content on single lines As requested by @cte in PR comment
|
Hi @cte! I've successfully implemented the requested change to use file size instead of line count for detecting large files. Summary of changes:
This approach better handles edge cases where files might have a huge amount of content on single lines, as you mentioned. The implementation still respects the dynamic context window calculation that was added in the previous commit. All tests are passing locally, and CI checks are currently running. The changes have been pushed to this PR branch. |
src/core/tools/readFileTool.ts
Outdated
| if (tokenCount > MAX_TOKEN_THRESHOLD) { | ||
| shouldApplySafeguard = true | ||
| linesToRead = FALLBACK_MAX_LINES | ||
| safeguardNotice = `<notice>This file is ${fileSizeKB}KB and contains approximately ${tokenCount.toLocaleString()} tokens, which could consume a significant portion of the context window. Showing only the first ${FALLBACK_MAX_LINES} lines to preserve context space. Use line_range if you need to read specific sections.</notice>\n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using the i18n translation function for the safeguard notice message (e.g. on line 550) instead of hardcoded English text, to ensure consistency with other user‐facing messages.
This comment was generated because it violated a code review rule: irule_C0ez7Rji6ANcGkkX.
…d i18n - Replace line-based limiting with character-based limiting to handle files with very long lines - Move threshold constants to packages/types/src/file-limits.ts for better organization - Add readLinesWithCharLimit function that truncates at complete line boundaries - Optimize file reading to avoid double reads when checking token count - Add i18n support for safeguard notice messages in all 18 supported languages - Update tests to match new character-based implementation - Safeguard now limits by character count (200KB default) instead of line count - Ensures files are never truncated in the middle of a line
|
Closing in favor of #6319 |
Fixes #6155
This PR implements a safeguard to prevent consuming the entire context window when reading large files with
maxReadFileLineset to -1.Problem
When
maxReadFileLineis set to -1, thereadFileToolreads the full contents of a file. This can be problematic if the file is huge since it could consume the entire context window.Solution
Added a basic safeguard that:
maxReadFileLineis -1Changes
src/core/tools/readFileTool.tsto add the safeguard logicsrc/core/tools/__tests__/readFileTool.spec.tsTesting
maxReadFileLineis not -1Performance Considerations
Important
Adds a safeguard in
readFileToolto handle large files by limiting lines read whenmaxReadFileLineis -1, with new utility and tests.readFileTool.tsto limit lines read for large files whenmaxReadFileLineis -1.tiktokento count tokens and applies safeguard if token count exceeds 50,000.readLinesWithCharLimitinread-lines-char-limit.tsto read lines up to a character limit.readFileTool.spec.tsfor various scenarios including large files, token count checks, and line range handling.read-lines-char-limit.spec.tsfor the new utility function.This description was created by
for f52f374. You can customize this summary. It will automatically update as commits are pushed.