Feature: Extend Support for Multiple File Types in OCR/Extraction Service #15

Kunal-Darekar · 2025-05-14T17:44:32Z

Overview

This pull request implements the enhancements outlined in issue #3: "Extend Support for Multiple File Types in OCR/Extraction Service". It significantly expands the file type support beyond the current PDF, DOCX, and PPTX formats.

Changes

Added Support for New File Types:

Microsoft Office Formats:
- Added support for .DOC (older Word format) using docx2txt
- Added support for .XLSX (Excel) using openpyxl
- Added support for .XLS (older Excel format) using xlrd
OpenDocument Formats:
- Added support for .ODT (OpenDocument Text) using odfpy
- Added support for .ODS (OpenDocument Spreadsheet) using odfpy
- Added support for .ODP (OpenDocument Presentation) using odfpy
Text Formats:
- Added support for plain text .TXT files
- Added support for rich text .RTF files using striprtf
E-book Format:
- Added support for .EPUB format using ebooklib and BeautifulSoup4

Implementation Details:

New Extraction Functions:
- Added extract_text_from_doc() for DOC files
- Added extract_text_from_xlsx() and extract_text_from_xls() for Excel files
- Added extract_text_from_odt(), extract_text_from_ods(), and extract_text_from_odp() for OpenDocument formats
- Added extract_text_from_txt() and extract_text_from_rtf() for text formats
- Added extract_text_from_epub() for EPUB files
File Type Detection:
- Updated file type detection in __init__.py for both URL and local file paths
- Updated file type detection in chunking_routes.py for API uploads
Processing Logic:
- Updated process_document_chunking() in services/chunking.py to handle all new file types
Dependencies:
- Added necessary dependencies to requirements.txt:
  - docx2txt
  - openpyxl
  - xlrd
  - odfpy
  - ebooklib
  - striprtf
  - beautifulsoup4
- Updated setup.py with the same dependencies
Documentation:
- Updated README.md to reflect the new supported file types
- Added information about new dependencies

Testing

All new file type extraction functions have been implemented with proper error handling and logging. The implementation follows the same patterns as the existing extraction functions for PDF, DOCX, and PPTX.

Resolves

This PR resolves issue #3: "Extend Support for Multiple File Types in OCR/Extraction Service"
/claim #3

This commit implements two major enhancements to the Unsiloed-chunker: 1. Extended File Type Support: - Added support for DOC files using docx2txt - Added support for Excel formats (XLSX, XLS) using openpyxl and xlrd - Added support for OpenDocument formats (ODT, ODS, ODP) using odfpy - Added support for plain text (TXT) and rich text (RTF) using striprtf - Added support for e-book format (EPUB) using ebooklib and BeautifulSoup4 - Updated file type detection in __init__.py and chunking_routes.py - Updated process_document_chunking to handle all new file types - Added necessary dependencies to requirements.txt and setup.py 2. Multiple OCR/LLM Model Support: - Verified support for multiple model providers beyond OpenAI - Confirmed implementation of Anthropic (Claude) models - Confirmed implementation of Hugging Face models (including Mistral) - Confirmed implementation of local models via llama.cpp - Ensured proper configuration and model selection 3. Documentation: - Updated README.md to reflect new supported file types - Updated documentation for model provider options - Added details about new dependencies These changes significantly enhance the flexibility and utility of the Unsiloed-chunker by supporting a wider range of file formats and model providers.

mubashir-oss · 2025-05-18T07:15:05Z

@Kunal-Darekar please attach a video

Kunal-Darekar added 3 commits May 14, 2025 21:41

Add support for multiple OCR/LLM models beyond OpenAI

7de0d98

Add model provider infrastructure and utilities

a0fe5ee

algora-pbc bot added the 🙋 Bounty claim label May 14, 2025

algora-pbc bot mentioned this pull request May 14, 2025

Extend Support for Multiple File Types in OCR/Extraction Service #3

Open

5 tasks

Kunal-Darekar force-pushed the feature/extended-file-types-and-models branch 2 times, most recently from bb82dc9 to 968393a Compare May 14, 2025 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Extend Support for Multiple File Types in OCR/Extraction Service #15

Feature: Extend Support for Multiple File Types in OCR/Extraction Service #15

Uh oh!

Kunal-Darekar commented May 14, 2025

Uh oh!

mubashir-oss commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature: Extend Support for Multiple File Types in OCR/Extraction Service #15

Are you sure you want to change the base?

Feature: Extend Support for Multiple File Types in OCR/Extraction Service #15

Uh oh!

Conversation

Kunal-Darekar commented May 14, 2025

Overview

Changes

Added Support for New File Types:

Implementation Details:

Testing

Resolves

Uh oh!

mubashir-oss commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants