Skip to content

Conversation

@samhvw8
Copy link
Contributor

@samhvw8 samhvw8 commented Jun 3, 2025

Related GitHub Issue

Closes: #4287

Description

  • Allow specific binary formats (.pdf, .docx, .ipynb) to be processed by extractTextFromFile
  • Block unsupported binary files with existing "Binary file" notice
  • Update tests to cover both supported and unsupported binary file scenarios
  • Refactor test mocks for better maintainability and coverage

Test Procedure

Type of Change

  • 🐛 Bug Fix: Non-breaking change that fixes an issue.
  • New Feature: Non-breaking change that adds functionality.
  • 💥 Breaking Change: Fix or feature that would cause existing functionality to not work as expected.
  • ♻️ Refactor: Code change that neither fixes a bug nor adds a feature.
  • 💅 Style: Changes that do not affect the meaning of the code (white-space, formatting, etc.).
  • 📚 Documentation: Updates to documentation files.
  • ⚙️ Build/CI: Changes to the build process or CI configuration.
  • 🧹 Chore: Other changes that don't modify src or test files.

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Code Quality:
    • My code adheres to the project's style guidelines.
    • There are no new linting errors or warnings (npm run lint).
    • All debug code (e.g., console.log) has been removed.
  • Testing:
    • New and/or updated tests have been added to cover my changes.
    • All tests pass locally (npm test).
    • The application builds successfully with my changes.
  • Branch Hygiene: My branch is up-to-date (rebased) with the main branch.
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Changeset: A changeset has been created using npm run changeset if this PR includes user-facing changes or dependency updates.
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

Documentation Updates

Additional Notes

Get in Touch


Important

Add support for reading .pdf, .docx, and .ipynb files in readFileTool, updating tests accordingly.

  • Behavior:
    • readFileTool in readFileTool.ts now supports reading .pdf, .docx, and .ipynb files using extractTextFromFile.
    • Unsupported binary files are blocked with a "Binary file" notice.
  • Tests:
    • Updated tests in readFileTool.test.ts to cover supported and unsupported binary file scenarios.
    • Refactored test mocks for better maintainability and coverage.

This description was created by Ellipsis for 8389e00d0942b1fbf0c4d84ed723fae7df53c25b. You can customize this summary. It will automatically update as commits are pushed.

@samhvw8 samhvw8 requested review from cte and mrubens as code owners June 3, 2025 12:57
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jun 3, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User‐facing text is hardcoded (e.g. 'Binary file') when blocking unsupported binary files. Use the translation function t(...) so that the UI messages can be localized.

Suggested change
notice: "Binary file",
notice: t("tools:readFile.binaryFile"),

This comment was generated because it violated a code review rule: irule_C0ez7Rji6ANcGkkX.

@daniel-lxs
Copy link
Member

daniel-lxs commented Jun 3, 2025

This PR fixes a regression:

  1. August 31, 2024 (commit 1d87bcf7): Initial PDF/DOCX support was added

    • Created src/utils/extract-text.ts with PDF and DOCX reading capabilities
    • Added dependencies: pdf-parse and mammoth
  2. August 31, 2024 (commit 6cbd2320): Quick fix applied to the initial implementation

  3. May 31, 2025 (commit 9ba0cd5c): Regression introduced here.

    // Handle binary files
    if (isBinary) {
        updateFileResult(relPath, {
            notice: "Binary file",
            xmlContent: `<file><path>${relPath}</path>\n<notice>Binary file</notice>\n</file>`,
        })
        continue
    }

This PR adds another check for supported extensions, however extractTextFromFile already handles this.

Looks good to me.

@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Review] in Roo Code Roadmap Jun 3, 2025
@samhvw8 samhvw8 force-pushed the fix/read-pdf-docx-ipynb branch from 8389e00 to f24cac3 Compare June 3, 2025 13:30
Copy link
Collaborator

@mrubens mrubens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good to me if it still looks good to @daniel-lxs (er... once tests pass)

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 3, 2025
@daniel-lxs
Copy link
Member

daniel-lxs commented Jun 3, 2025

Nice! Looks good to me if it still looks good to @daniel-lxs (er... once tests pass)

Yeah, we do have the same check twice (once in read-file.ts and another in extract-text.ts) but it might help prevent future issues if extractTextFromFile changes for some reason.

…ad_file tool

- Allow specific binary formats (.pdf, .docx, .ipynb) to be processed by extractTextFromFile
- Block unsupported binary files with existing "Binary file" notice
- Update tests to cover both supported and unsupported binary file scenarios
- Refactor test mocks for better maintainability and coverage
@samhvw8 samhvw8 force-pushed the fix/read-pdf-docx-ipynb branch from f24cac3 to b564634 Compare June 3, 2025 13:47
@samhvw8
Copy link
Contributor Author

samhvw8 commented Jun 3, 2025

@mrubens @daniel-lxs tests passed now haha 😆

@mrubens mrubens merged commit 6baf28c into RooCodeInc:main Jun 3, 2025
9 checks passed
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jun 3, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Review] to Done in Roo Code Roadmap Jun 3, 2025
@samhvw8 samhvw8 deleted the fix/read-pdf-docx-ipynb branch June 3, 2025 19:49
samhvw8 added a commit to samhvw8/Roo-Cline that referenced this pull request Jun 3, 2025
…ad_file tool (RooCodeInc#4288)

- Allow specific binary formats (.pdf, .docx, .ipynb) to be processed by extractTextFromFile
- Block unsupported binary files with existing "Binary file" notice
- Update tests to cover both supported and unsupported binary file scenarios
- Refactor test mocks for better maintainability and coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer PR - Needs Review size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Read multiple pdf & docx file error

3 participants