-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: Add file read caching to prevent redundant reads in conversation history #4501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add file read caching to prevent redundant reads in conversation history #4501
Conversation
atttempt 2.
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
…ve error handling
src/core/tools/readFileTool.ts
Outdated
| // Track file read | ||
| await cline.fileContextTracker.trackFileContext(relPath, "read_tool" as RecordSource) | ||
|
|
||
| const stats = fs.statSync(fullPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using synchronous fs.statSync inside an async function to prevent blocking the event loop. Consider using fs.promises.stat for a non‐blocking alternative.
| const stats = fs.statSync(fullPath) | |
| const stats = await fs.promises.stat(fullPath) |
- Implement MemoryAwareCache with 100MB limit and LRU eviction - Fix syntax error in processAndFilterReadRequest function - Add proper error handling for file permissions (EACCES, EPERM) - Handle file deletion scenarios by removing from cache - Add logging for cache evictions and errors - Update imports to use fs/promises for test compatibility All tests passing (12/12)
daniel-lxs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Mnehmos, thank you for taking this issue.
Overall I think it's a good idea to figure out if the issue needs a complex implementation like this to prevent reads on files that were recently read.
Keeping the content of the files in a cache to determine if a file read needs to be rejected might be a bit of an overkill.
I would like to hear your thoughts about this.
| const isMultipleReadsEnabled = maxConcurrentReads > 1 | ||
|
|
||
| return `## read_file | ||
| Description: Request to read the contents of ${isMultipleReadsEnabled ? "one or more files" : "a file"}. The tool outputs line-numbered content (e.g. "1 | const x = 1") for easy reference when creating diffs or discussing code.${args.partialReadsEnabled ? " Use line ranges to efficiently read specific portions of large files." : ""} Supports text extraction from PDF and DOCX files, but may not handle other binary files properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find mentions about why partial reads are being permanently enabled. Is this change intentional? since partial reads can be disabled in the settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is intentional and addresses Issue #4009. Let me clarify what's actually happening:
The Problem Being Solved:
The "Always read entire file" setting (maxReadFileLine = -1) was prohibiting line-range reads entirely, forcing users to always read complete files even when they had specific line numbers from:
git grep -n results
Compiler/linter error messages
search_files output
Manual diffs with line references
What This Change Does:
Preserves existing behavior: When no <line_range> is specified, entire files are still read
Adds intelligent choice: Model can now choose line ranges when contextually appropriate
Maintains the setting's intent: "Always read entire file" becomes the default, not an absolute restriction
Technical Detail:
Previously: partialReadsEnabled = maxReadFileLine !== -1 meant unlimited readers couldn't see line-range options
Now: Line ranges are always available in the tool interface, letting the model make smart decisions based on context
This transforms a rigid limitation into flexible intelligence - the model gets entire files by default but can target specific lines when it has line numbers to work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a simpler way of checking if the recently read file hasn't changed? on codebase indexing we use hashes to verify that the content of the file hasn't changed. If the file read is being rejected do we need to keep a cache of the whole file or would keeping a hash be a better option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the code, I need to clarify the current implementation:
Current Implementation Reality:
The cache does NOT store full file content
It only stores metadata: { mtime: string, size: number }
Cache decisions are based on conversation history analysis + mtime comparison
Memory tracking is for metadata size limits, not content storage
Your Hash Suggestion Benefits:
More Reliable: Hashes detect actual content changes vs mtime manipulation
Already Proven: Works well in your codebase indexing
Potentially Simpler: Could replace mtime + conversation history analysis
Current Approach Issues:
mtime comparison can miss cases where file content changes but timestamp is preserved
Conversation history parsing is complex
Still requires file stat calls
Hash-Based Alternative:
interface HashCacheEntry {
hash: string; // Content hash
lastRead: number; // Timestamp of last read
lineRanges: LineRange[]; // Ranges read at this hash
}
typescript
Trade-off Question:
The hash approach requires reading files to generate hashes, which adds I/O cost. However, it provides stronger content change detection than mtime alone.
Would the hash generation cost be acceptable given the improved reliability and potential to simplify the conversation history analysis logic?
PR for feature/4009-pr2-file-read-caching
Related GitHub Issue
Closes: #4009
Description
This PR introduces a file read caching service to improve performance by caching the content of files that have been read. This helps to avoid re-reading the same file multiple times during a conversation.
This PR specifically addresses the following feedback from the original implementation:
Test Procedure
pnpm testto execute all tests and ensure that the new tests for the caching service pass and that there are no regressions.Type of Change
Pre-Submission Checklist
Screenshots / Videos
N/A
Documentation Updates
None required.
Additional Notes
This is the second of three PRs to address the work in issue #4009.
Important
Introduces a file read caching service with LRU cache, integrates it with existing tools, and adds tests for improved performance and error handling.
fileReadCacheService.tsfor caching file contents with LRU cache.processAndFilterReadRequest()to manage cache hits and misses.fileReadCacheService.ts.readFileTool.ts,applyDiffTool.ts, andwriteToFileTool.ts.readFileTool.tsto use cache results for reading files.fileReadCacheService.spec.tsandreadFileTool.test.tsfor caching logic.lruCache.tsutility for cache management.esbuild.mjsto clean assets directory.This description was created by
for 2071612. You can customize this summary. It will automatically update as commits are pushed.