Skip to content

Conversation

@Kunal-Darekar
Copy link

Overview

This pull request implements the enhancements outlined in issue #3: "Extend Support for Multiple File Types in OCR/Extraction Service". It significantly expands the file type support beyond the current PDF, DOCX, and PPTX formats.

Changes

Added Support for New File Types:

  1. Microsoft Office Formats:

    • Added support for .DOC (older Word format) using docx2txt
    • Added support for .XLSX (Excel) using openpyxl
    • Added support for .XLS (older Excel format) using xlrd
  2. OpenDocument Formats:

    • Added support for .ODT (OpenDocument Text) using odfpy
    • Added support for .ODS (OpenDocument Spreadsheet) using odfpy
    • Added support for .ODP (OpenDocument Presentation) using odfpy
  3. Text Formats:

    • Added support for plain text .TXT files
    • Added support for rich text .RTF files using striprtf
  4. E-book Format:

    • Added support for .EPUB format using ebooklib and BeautifulSoup4

Implementation Details:

  1. New Extraction Functions:

    • Added extract_text_from_doc() for DOC files
    • Added extract_text_from_xlsx() and extract_text_from_xls() for Excel files
    • Added extract_text_from_odt(), extract_text_from_ods(), and extract_text_from_odp() for OpenDocument formats
    • Added extract_text_from_txt() and extract_text_from_rtf() for text formats
    • Added extract_text_from_epub() for EPUB files
  2. File Type Detection:

    • Updated file type detection in __init__.py for both URL and local file paths
    • Updated file type detection in chunking_routes.py for API uploads
  3. Processing Logic:

    • Updated process_document_chunking() in services/chunking.py to handle all new file types
  4. Dependencies:

    • Added necessary dependencies to requirements.txt:
      • docx2txt
      • openpyxl
      • xlrd
      • odfpy
      • ebooklib
      • striprtf
      • beautifulsoup4
    • Updated setup.py with the same dependencies
  5. Documentation:

    • Updated README.md to reflect the new supported file types
    • Added information about new dependencies

Testing

All new file type extraction functions have been implemented with proper error handling and logging. The implementation follows the same patterns as the existing extraction functions for PDF, DOCX, and PPTX.

Resolves

This PR resolves issue #3: "Extend Support for Multiple File Types in OCR/Extraction Service"
/claim #3

This commit implements two major enhancements to the Unsiloed-chunker:

1. Extended File Type Support:
   - Added support for DOC files using docx2txt
   - Added support for Excel formats (XLSX, XLS) using openpyxl and xlrd
   - Added support for OpenDocument formats (ODT, ODS, ODP) using odfpy
   - Added support for plain text (TXT) and rich text (RTF) using striprtf
   - Added support for e-book format (EPUB) using ebooklib and BeautifulSoup4
   - Updated file type detection in __init__.py and chunking_routes.py
   - Updated process_document_chunking to handle all new file types
   - Added necessary dependencies to requirements.txt and setup.py

2. Multiple OCR/LLM Model Support:
   - Verified support for multiple model providers beyond OpenAI
   - Confirmed implementation of Anthropic (Claude) models
   - Confirmed implementation of Hugging Face models (including Mistral)
   - Confirmed implementation of local models via llama.cpp
   - Ensured proper configuration and model selection

3. Documentation:
   - Updated README.md to reflect new supported file types
   - Updated documentation for model provider options
   - Added details about new dependencies

These changes significantly enhance the flexibility and utility of the
Unsiloed-chunker by supporting a wider range of file formats and model providers.
@Kunal-Darekar Kunal-Darekar force-pushed the feature/extended-file-types-and-models branch 2 times, most recently from bb82dc9 to 968393a Compare May 14, 2025 17:53
@mubashir-oss
Copy link
Contributor

@Kunal-Darekar please attach a video

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants