Skip to content

Conversation

@luffy-orf
Copy link

@luffy-orf luffy-orf commented May 11, 2025

Description

Extends document processing service to support additional file types beyond the original PDF, DOCX, and PPTX.

Solves - #3

Changes

  • Added support for:
    • Microsoft Office: DOC, XLSX, XLS
    • OpenDocument: ODT, ODS, ODP
    • Text: TXT, RTF
    • E-books: EPUB
  • Implemented text extraction handlers for each format
  • Added required dependencies
  • Updated file type validation

Dependencies Added

  • openpyxl, xlrd (Excel)
  • odfpy (OpenDocument)
  • striprtf (RTF)
  • ebooklib, beautifulsoup4 (EPUB)

Demo -

https://drive.google.com/file/d/1lXL6kiJuQFdLX2UrFl48x2zmYYzLza5z/view?usp=sharing
/claim #3

@luffy-orf
Copy link
Author

@mubashir-oss Please review the PR and let me know if changes are required

@mubashir-oss
Copy link
Contributor

@luffy-orf please attach a short recording

@luffy-orf
Copy link
Author

@mubashir-oss Sure

@luffy-orf
Copy link
Author

@mubashir-oss
Copy link
Contributor

mubashir-oss commented May 20, 2025

@luffy-orf it is not working for xlsx (semantic chunking) please check

@luffy-orf
Copy link
Author

luffy-orf commented May 20, 2025

@mubashir-oss Fixed the .xlsx file errors for big files here's the working video have tested it for small and big files - https://drive.google.com/file/d/1KMKc9qZA_5xloGdfkIYu-bNVobQ4l0mJ/view?usp=sharing

Latest Full testing demo -
https://drive.google.com/file/d/1lXL6kiJuQFdLX2UrFl48x2zmYYzLza5z/view?usp=sharing

@luffy-orf
Copy link
Author

@mubashir-oss @adnan-cto Review on the PR would be appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants