-
Notifications
You must be signed in to change notification settings - Fork 10.9k
feat: prevent large text file read context pollution #17468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: prevent large text file read context pollution #17468
Conversation
Implement a configurable byte threshold (512 KiB default) for text file reads. When exceeded without an explicit limit parameter, the read fails with guidance to use chunked reading or surgical tools. - Add TEXT_FILE_READ_TOO_BROAD error type - Allow limit=-1 to explicitly read entire file - Any explicit limit bypasses threshold check - Remove 2000-char line truncation (threshold gates large files instead) - Remove 2000-line default limit - Rename formatMemoryUsage to formatBytes (core package only) - Configurable via GEMINI_TEXT_FILE_READ_THRESHOLD_BYTES env var Closes google-gemini#14991
Update documentation for the new text file read threshold feature: - file-system.md: Update read_file tool with large file handling section, remove outdated 2000-line default references - configuration.md: Add GEMINI_TEXT_FILE_READ_THRESHOLD_BYTES env var
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @Nubebuster, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the agent's ability to manage context and prevent performance degradation when interacting with large text files. By introducing a configurable size threshold for text file reads, the system now proactively guides the agent towards more efficient and targeted data access methods, such as reading in chunks or using specialized tools. This change also streamlines file processing by removing previous automatic truncation mechanisms, ensuring that when a file is read, its content is fully preserved up to the defined threshold or explicit limits. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Update test to reflect that files under the byte threshold are read in full without automatic line truncation.
77ddc52 to
281e383
Compare
|
Removed Claude Code as Co-Author causing CLA failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature to prevent context pollution by adding a configurable size threshold for reading text files. It also correctly removes the previous, less effective line-length truncation logic. The implementation is generally solid, and the documentation updates are clear. My review identified two high-severity issues, which are detailed in the comments. First, the parsing of the new environment variable for the threshold is not fully robust and could lead to confusing behavior if misconfigured. Second, the read_many_files tool does not appear to handle cases where a file exceeds the new size threshold, which could cause the tool to fail unexpectedly. I have provided detailed comments and suggestions to address these points.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Hi there! Thank you for your contribution to Gemini CLI. We really appreciate the time and effort you've put into this pull request. To keep our backlog manageable and ensure we're focusing on current priorities, we are closing pull requests that haven't seen maintainer activity for 30 days. Currently, the team is prioritizing work associated with 🔒 maintainer only or help wanted issues. If you believe this change is still critical, please feel free to comment with updated details. Otherwise, we encourage contributors to focus on open issues labeled as help wanted. Thank you for your understanding! |
Re-opening #15175 as I cannot re-open it.
Update: Still important feature. Please review. It already got a LGTM:
Originally posted by @LIHUA919 in #15175 (review)
Summary
Add a configurable text file read threshold (512 KiB default) that fails overly-broad text file reads with guidance, preventing context pollution from massive file reads. Additionally removes old automatic content line count and line length truncation logic.
Details
Problem: The agent often reads large text files (logs, JSON, JSONL, CSV, etc.) that consume excessive context tokens, degrading attention and increasing latency. The current 2k-lines/2k-chars truncation doesn't adequately address this —truncated content often strip key information while retaining useless metadata.
Solution: Fail suspected overly-broad reads before they happen, guiding the agent to adapt:
offset/limitgrep,head,tail,jq)limit=-1when full read is intentionalImplementation notes:
formatMemoryUsage→formatBytesrename is partial (core only); PR refactor: rename formatMemoryUsage to formatBytes #14997 completes the refactor project-wideOpen questions for reviewers:
settings.jsoninstead of env var?Related Issues
Closes #14991
How to Validate
Create a large text file (>512 KiB):
Start Gemini CLI and ask it to read the file:
Expected: Read fails with error guiding to use
offset/limitorlimit=-1Ask with explicit limit:
Expected: Read succeeds, full content returned
Test threshold override:
Expected: Read succeeds (threshold disabled)