Implement Multi-Format Document Content Extraction Framework by gsantoshkumar1999 · Pull Request #233 · souzatharsis/podcastfy

gsantoshkumar1999 · 2025-01-30T18:02:03Z

This PR introduces comprehensive multi-format content extraction capabilities, enhances error handling, and adds robust test coverage. The changes enable seamless extraction of text from PDFs, Microsoft Office documents (DOC/DOCX, XLS/XLSX/XLSM, PPT/PPTX), and plain text files, supporting both local and web-hosted sources.

Fixes #1
Fixes #138

Key Enhancements

1. Enhanced PDFExtractor

✨ Web & Local PDF Support: Extract text from both URL-hosted and local PDFs.
📄 Page-Wise Extraction: Improved parsing to handle large documents efficiently.
🔍 Text Normalization: Handle special characters and whitespace for cleaner output.
🛠️ Error Handling: Graceful failure and logging for invalid URLs, corrupted files, and extraction errors.

2. Microsoft Office Suite Support

📄Word Documents
- .doc and .docx support
- Paragraph-level extraction
- Format preservation options
📎Excel Workbooks
- .xls, .xlsx, .xlsm handling
- Multi-sheet support
🎞️PowerPoint Presentations
- .ppt and .pptx compatibility
- Slide-wise content extraction
📝 Text File Processing:
- Auto-detect file encodings
- Local and URL-based
- UTF-8 and extended charset support

3. Files from Cloud Integration Ready

☁️ Directly integrate cloud-hosted documents (e.g., Google Drive, GCP Cloud Storage Buckets, AWS S3, etc) into podcast generation pipelines.

4. Unified ContentExtractor

🤖 Smart Source Detection: Auto-identify file types (PDF, Office, Text) and sources (local vs. URL).
🔄 Modular Integration: Leverage PDFExtractor, OfficeExtractor, and TextExtractor for multi-format support.
📊 Main Method Updates: Tested with diverse inputs (e.g., web PDFs, local XLSX files, URL-hosted DOCX).

5. Comprehensive Test Suite

✅ Expanded Coverage: Unit tests for edge cases (large files, malformed URLs, encoding issues).
🧪 Mocked Web Requests: Safely simulate web-hosted file interactions.
🛡️ Error Scenario Tests: Validate handling of timeouts, invalid paths, and unsupported formats.

…ts (.doc, .docx, .xls, .xlsx, xlsm, .ppt and .pptx) online and local files

…libraries

gsantoshkumar1999 · 2025-01-31T06:57:56Z

Hi @souzatharsis
Request you to please check this PR and run the checks.
thanks :)

souzatharsis · 2025-02-01T20:06:48Z

Hi Santosh, many thanks for your PR.

I think it enables useful features to users.
However, it adds complexity to the implementation.
Instead of implementing parsers per document type, I'd favor using solutions such as Docling which (i) provides a unified way to parse multi-type documents with (ii) a widely supported implementation by open source community and (iii) advanced OCR capabilities.

I'd recommend taking a look at Docling which potentially would deliver the same or better outcome with a considerably fewer lines of code.

What do you think?

gsantoshkumar1999 · 2025-02-02T02:46:35Z

Hi Santosh, many thanks for your PR.

I think it enables useful features to users. However, it adds complexity to the implementation. Instead of implementing parsers per document type, I'd favor using solutions such as Docling which (i) provides a unified way to parse multi-type documents with (ii) a widely supported implementation by open source community and (iii) advanced OCR capabilities.

I'd recommend taking a look at Docling which potentially would deliver the same or better outcome with a considerably fewer lines of code.

What do you think?

Sure @souzatharsis
I'll have a look and see if I can contribute using docling 👍🏽

…rary

gsantoshkumar1999 added 6 commits January 30, 2025 22:50

Enhance PDF Extractor with URL and Robust Text Extraction Support

463fde0

Add TextExtractor for robust text file content extraction

28a03d9

Add OfficeExtractor for extracting text from Microsoft Office documen…

d9410e9

…ts (.doc, .docx, .xls, .xlsx, xlsm, .ppt and .pptx) online and local files

Enhance ContentExtractor with multi-format content extraction support

e4648ef

Add comprehensive test suite for content extraction modules

0ff31d2

Update dev-requirements with additional document and data processing …

6a93327

…libraries

gsantoshkumar1999 added 2 commits February 2, 2025 19:09

Merge branch 'souzatharsis:main' into multidoc-support

153b5b1

Refactor content extraction with unified extractor and markitdown lib…

6e3d2ec

…rary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Multi-Format Document Content Extraction Framework#233

Implement Multi-Format Document Content Extraction Framework#233
gsantoshkumar1999 wants to merge 8 commits intosouzatharsis:mainfrom
gsantoshkumar1999:multidoc-support

gsantoshkumar1999 commented Jan 30, 2025 •

edited

Loading

Uh oh!

gsantoshkumar1999 commented Jan 31, 2025

Uh oh!

souzatharsis commented Feb 1, 2025

Uh oh!

gsantoshkumar1999 commented Feb 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gsantoshkumar1999 commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Enhancements

Uh oh!

gsantoshkumar1999 commented Jan 31, 2025

Uh oh!

souzatharsis commented Feb 1, 2025

Uh oh!

gsantoshkumar1999 commented Feb 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gsantoshkumar1999 commented Jan 30, 2025 •

edited

Loading