-
Notifications
You must be signed in to change notification settings - Fork 122
feat: add PPTX document parser #693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Trivy scanning results. Report Summary ┌───────────────────┬──────┬─────────────────┬─────────┐
For OSS Maintainers: VEX NoticeIf you're an OSS maintainer and Trivy has detected vulnerabilities in your project that you believe are not actually exploitable, consider issuing a VEX (Vulnerability Exploitability eXchange) statement. To disable this notice, set the TRIVY_DISABLE_VEX_NOTICE environment variable. uv.lock (uv)Total: 21 (MEDIUM: 15, HIGH: 4, CRITICAL: 2) ┌──────────────┬────────────────┬──────────┬────────┬───────────────────┬───────────────┬──────────────────────────────────────────────────────────────┐ |
5538040
to
db36d39
Compare
Code Coverage Summary
Diff against main
Results for commit: a1ed241 Minimum allowed coverage is ♻️ This comment has been updated with latest results |
16cd131
to
948f935
Compare
- Reformatted authors and dependencies in pyproject.toml for consistency. - Added PptxDocumentParser to the list of exported components in the ingestion parsers. - Updated the router to use PptxDocumentParser for PPTX document types.
e0e7537
to
2596a48
Compare
…i/ragbits into 687-feat-pptx-parser
- Introduced a new script for creating dummy PPTX files to facilitate testing of the PPTX parser. - Updated the PptxDocumentParser to utilize DocumentMeta for improved document handling. - Refactored extractor classes to enhance clarity and maintainability, including renaming to follow a consistent naming convention. - Improved extraction methods for text, hyperlinks, images, shapes, metadata, and speaker notes.
- Updated the test script to cast shapes to the correct type before accessing text frames, ensuring proper functionality of the PPTX parser during testing.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds support for parsing PPTX files by introducing a new parser and modular extractors, updating routing and dependencies.
- Introduce
PptxDocumentParser
and multiple extractors for text, images, hyperlinks, shapes, speaker notes, and metadata. - Update the router to use
PptxDocumentParser
forDocumentType.PPTX
. - Add the
python-pptx
dependency and clean up formatting across configuration and exports.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/router.py | Route DocumentType.PPTX to PptxDocumentParser |
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/parser.py | New asynchronous PPTX parser implementation |
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/extractors/extractors.py | Add modular extractors for various PPTX content types |
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/extractors/init.py | Export extractor classes |
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/init.py | Export PptxDocumentParser in the parser module |
packages/ragbits-document-search/pyproject.toml | Include python-pptx dependency and format cleanup |
packages/ragbits-document-search/CHANGELOG.md | Document the new PPTX parser feature |
examples/document-search/test_pptx_parser.py | Temporary example script for manual PPTX parser testing |
Comments suppressed due to low confidence (2)
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/parser.py:37
- Add unit tests for
PptxDocumentParser.parse
and associated extractors in the test suite to ensure reliable parsing behavior.
async def parse(self, document: Document) -> list[Element]:
packages/ragbits-document-search/CHANGELOG.md:232
- Remove the stray
-
under### Changed
to clean up the changelog formatting.
- Add LiteLLM Reranker (#109).
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/parser.py
Show resolved
Hide resolved
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/parser.py
Outdated
Show resolved
Hide resolved
...-document-search/src/ragbits/document_search/ingestion/parsers/pptx/extractors/extractors.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few things, overall concept is nice.
I would also suggestion using cursor or Claude 4 github copilot to write nice docstrings and handle formatting.
...-document-search/src/ragbits/document_search/ingestion/parsers/pptx/extractors/extractors.py
Outdated
Show resolved
Hide resolved
…i/ragbits into 687-feat-pptx-parser
- Added custom exceptions for PPTX parsing errors, including PptxExtractionError, PptxExtractorError, PptxParserError, PptxPresentationError, and PptxSlideProcessingError. - Enhanced the PptxDocumentParser to raise appropriate exceptions during parsing failures and log detailed error messages. - Improved logging throughout the extraction process to track successful and failed extractions, including shape processing and metadata extraction. - Updated extractor classes to handle errors gracefully and provide informative logs for debugging.
- Deleted the test script for creating dummy PPTX files, which was used for development and PR testing purposes. This script will not be included in the final merge.
…ithub.com/deepsense-ai/ragbits into extension/source/googledrive/impersonator
4a27750
to
ea41ed3
Compare
merge #733 first |
…gleDriveSource - Introduced GoogleDriveExportFormat enum to standardize export MIME types. - Updated _GOOGLE_EXPORT_MIME_MAP and _EXPORT_EXTENSION_MAP to use the new enum. - Modified fetch method to accept an optional export_format parameter for overriding MIME types. - Enhanced _determine_file_extension method to support MIME type overrides. - Added unit test for verifying correct file extension determination with overridden MIME types.
…se-ai/ragbits into 687-feat-pptx-parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me.
…i/ragbits into 687-feat-pptx-parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice job on the override feature.
…ation - Removed unnecessary logging and commented code in the PptxDocumentParser and BasePptxExtractor classes. - Consolidated element creation methods for text and images, improving clarity and maintainability. - Updated extraction methods to handle elements more generically, allowing for better extensibility.
…i/ragbits into 687-feat-pptx-parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One big problem I see is that we create Element
objects with extractors, and if I understand correctly these Elements
are ment to be transformed with an Embedder as they has to_vector_db_entry
. But in this case objects you extract should first go through grouping and chunking process. If we take every even the smallest object from pptx and embed it we will end up with ton of useless chunks and our RAG system will fail.
I tested unstrutured and docling for this task and unstructured partition_pptx
didn't extract images from my slides. However, docling did great job.
from docling.document_converter import DocumentConverter
from docling_core.types.doc.document import PictureItem
converter = DocumentConverter()
doc = converter.convert(source="../testing_scripts/Slides.pptx").document
for item, idx in doc.iterate_items(with_groups=True):
if not isinstance(item, PictureItem):
print(item)
for idx, i in enumerate(doc.pictures):
img = i.image.pil_image
img.save(f"Slides_{idx}.png")
it loooks like it didn't extract presenter's notes, and some other stuff that we did.
But after we have it extracted with this form, we have docling grouping, serialization, chunking etc for free.
Also we can add more elements to the Docling's Document class
So what, I'd do:
- Utilise everything we can from docling directly - less code, less problems
- Then expand docling document with our custom implementation, our implementations should return docling compliant formats. What I'd personally do is implement callback system in the Docling parser. So you can add callbacks that will be run as we are parsing a pptx file. This callback takes something, and docling document, extract value from something and adds it to docling document
- Then I believe we can utilise Chunker from docling, or create a custom one.
But please discuss this with @mhordynski
...-document-search/src/ragbits/document_search/ingestion/parsers/pptx/extractors/extractors.py
Outdated
Show resolved
Hide resolved
...-document-search/src/ragbits/document_search/ingestion/parsers/pptx/extractors/extractors.py
Outdated
Show resolved
Hide resolved
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/__init__.py
Outdated
Show resolved
Hide resolved
packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/parser.py
Show resolved
Hide resolved
…o 687-feat-pptx-parser
…or names and remove unused imports
- Added new callback classes for extracting hyperlinks, metadata, and speaker notes from PPTX files. - Refactored the PPTX document parser to utilize these callbacks for enhancing the docling document. - Removed unused extractor classes and related imports to streamline the codebase. - Introduced a default callback list for easier configuration.
- Introduced a new integration test for the PPTX document parser to validate the extraction of hyperlinks, metadata, and speaker notes from a sample presentation. - Added a sample PPTX presentation asset for testing purposes.
Co-authored-by: Copilot <[email protected]> Co-authored-by: pocucan-ds <[email protected]> Co-authored-by: Mateusz Hordyński <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.