Skip to content

Feature Proposal: Implement read_file history deduplication to increase context quality and longevity #6279

@hannesrudolph

Description

@hannesrudolph

What specific problem does this solve?

When large language models work on tasks that require re-reading files multiple times, the read_file results accumulate in the conversation history (apiConversationHistory). This causes several problems:

Who is affected: All users working with Roo on tasks that involve multiple file reads, especially those working with large codebases or long-running tasks.

When this happens:

  • During iterative development where the AI needs to reference the same files repeatedly
  • When debugging issues that require checking file contents multiple times
  • During refactoring tasks that involve reading and modifying the same files

Current behavior: Each time a file is read, the complete file content is added to the conversation history, even if the same file was read recently. This leads to:

  • Rapid consumption of the context window (e.g., reading a 500-line file 5 times uses 2,500 lines of context)
  • Potential confusion when the AI sees multiple versions of the same file
  • Reduced effectiveness as important context gets pushed out by redundant file reads

Current behavior

Currently, every read_file operation adds its complete result to the conversation history, regardless of whether that exact file was read before. There is no deduplication mechanism in place.

Proposed solution

Implement a deduplication mechanism that keeps only the most recent read of each file in the conversation history. When the experimental feature READ_FILE_DEDUPLICATION is enabled, the system will automatically remove older reads of the same file whenever a new read occurs.

Key aspects:

  • Deduplication happens immediately after each file read
  • Only the most recent read of each file is preserved
  • The feature is opt-in via experimental settings
  • Works with both single and multi-file read operations

Impact

Who benefits: All Roo users, especially those working on:

  • Large codebases where files are frequently re-read
  • Long-running tasks that reference the same files multiple times
  • Iterative development workflows
  • Debugging sessions that require checking file state repeatedly

How it helps:

  • Increased context efficiency: Reduces redundant information in conversation history
  • Extended conversation longevity: Tasks can run longer before hitting context limits
  • Improved accuracy: AI sees only the most recent file state, reducing confusion
  • Better performance: Smaller context means faster processing

Technical Context

Based on codebase analysis:

  • The apiConversationHistory stores all tool results including file reads
  • File reads are stored as text blocks with format: [read_file for 'path'] Result:
  • The experimental feature system is already in place
  • Tests exist expecting a deduplicateReadFileHistory method but it's not implemented

Implementation approach:

  1. Add READ_FILE_DEDUPLICATION to experimental features
  2. Implement deduplicateReadFileHistory method in Task class
  3. Call deduplication after successful file reads in readFileTool

🔍 Comprehensive Issue Scoping

Root Cause / Implementation Target

When large language models re-read files multiple times during a conversation, the read_file results accumulate in the apiConversationHistory, causing excessive context consumption and potential confusion. The system currently lacks a mechanism to deduplicate these redundant file reads.

Affected Components

  • Primary Files:

    • src/shared/experiments.ts: Add new experimental feature flag
    • src/core/task/Task.ts (lines ~330-350): Add deduplicateReadFileHistory method
    • src/core/tools/readFileTool.ts (lines ~610-615): Integrate deduplication call
  • Secondary Impact:

    • Test files that mock or use Task class
    • Any tools that rely on conversation history structure
    • Experimental settings UI

Current Implementation Analysis

The system uses apiConversationHistory to store all API messages. When readFileTool executes, it adds results using pushToolResult, which creates text blocks in user messages with format: [read_file ...] Result: followed by XML-structured file content. These accumulate without any deduplication.

Proposed Implementation

Step 1: Add experimental feature flag

  • File: src/shared/experiments.ts
  • Changes: Add READ_FILE_DEDUPLICATION to EXPERIMENT_IDS and experimentConfigsMap
  • Rationale: Allows safe opt-in testing without affecting all users

Step 2: Add deduplicateReadFileHistory method to Task class

  • File: src/core/task/Task.ts
  • Changes: Add public method that checks experimental flag and iterates through apiConversationHistory in reverse, removing older reads of files that appear multiple times
  • Rationale: Keeps only the most recent read of each file while preserving message structure

Step 3: Integrate deduplication into readFileTool

  • File: src/core/tools/readFileTool.ts
  • Changes: Call cline.deduplicateReadFileHistory() after successful file reads
  • Rationale: Ensures deduplication happens immediately after new reads are added

Code Architecture Considerations

  • Follow existing patterns for experimental features (see POWER_STEERING implementation)
  • Follow existing patterns for message manipulation (see overwriteApiConversationHistory)
  • Preserve message structure integrity
  • Handle both single and multi-file read operations
  • Ensure deduplication works across all file read patterns

Testing Requirements

  • Unit Tests:
    • Test experimental flag enables/disables feature
    • Test basic deduplication with duplicate file reads
    • Test multi-file read handling
    • Test message structure preservation
    • Test edge cases (empty messages, malformed content)
    • Test deduplication with mixed file operations
  • Integration Tests:
    • Test readFileTool integration
    • Test performance with large conversation histories

Performance Impact

  • Expected performance change: Minimal (O(n) traversal of messages)
  • Optimization: Stop processing once all unique files are found
  • Memory impact: Reduced due to smaller conversation history

Security Considerations

  • No security implications - only reorganizes existing data
  • No external data exposure
  • No authentication/authorization changes

Migration Strategy

Not applicable - feature is backwards compatible and handles existing conversation formats.

Rollback Plan

Feature can be disabled by toggling off the experimental setting in the UI.

Dependencies and Breaking Changes

  • No external dependencies affected
  • No API contract changes
  • No breaking changes for users

Metadata

Metadata

Assignees

Labels

Issue/PR - TriageNew issue. Needs quick review to confirm validity and assign labels.enhancementNew feature or requestproposal

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions