extend document type support #10
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extend Document Type Support
Closes : #3
Description
This PR extends the document processing capabilities of Unsiloed by adding support for additional file formats. The service can now handle a wider range of document types while maintaining the existing chunking strategies and processing pipeline.
New Supported File Types
.DOC(Word documents).XLSXand.XLS(Excel spreadsheets).ODT(OpenDocument Text).ODS(OpenDocument Spreadsheet).ODP(OpenDocument Presentation).TXT(Plain text).RTF(Rich text format).EPUB(Electronic publication)Changes
extract_text_from_doc()for .DOC filesextract_text_from_xlsx()andextract_text_from_xls()for Excel filesextract_text_from_odt(),extract_text_from_ods(), andextract_text_from_odp()for OpenDocument formatsextract_text_from_txt()for plain text filesextract_text_from_rtf()for rich text filesextract_text_from_epub()for ebook filestextractfor .DOC and .RTF filespandasfor .XLSX and .XLS filesodfpyfor OpenDocument formatsebooklibfor .EPUB filesTesting
Please test the following scenarios:
Dependencies
New dependencies have been added to
setup.py:Documentation
/claim #3