Skip to content

Commit 4fb64e8

Browse files
committed
refactor(pptx): streamline PPTX document parsing and extraction process
- Removed unused parameters and methods from PptxDocumentParser to simplify the code. - Updated the extract method to directly extract elements from each slide using the configured extractors. - Adjusted the BaseExtractor class to ensure it returns a list of Element objects. - Enhanced logging for better error handling during the extraction process.
1 parent 6cc1062 commit 4fb64e8

File tree

2 files changed

+9
-399
lines changed
  • packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx

2 files changed

+9
-399
lines changed

packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/pptx/extractors/extractors.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
from pptx.presentation import Presentation
1010
from pptx.slide import Slide
1111

12+
from ragbits.document_search.documents.element import Element
13+
1214
from .dataclasses import (
1315
ExtractedHyperlink,
1416
ExtractedImage,
@@ -24,7 +26,7 @@ class BaseExtractor(ABC):
2426
"""Base class for all PPTX content extractors."""
2527

2628
@abstractmethod
27-
def extract(self, presentation: Presentation, slide: Slide | None = None) -> list[Any]:
29+
def extract(self, presentation: Presentation, slide: Slide | None = None) -> list[Element]:
2830
"""Extract content from the presentation or specific slide."""
2931
pass
3032

0 commit comments

Comments
 (0)