@@ -6,16 +6,14 @@ pdf2table/
66├── entities/
77│ └── table_entities.py
88├── usecases/
9- │ ├── dtos.py
109│ ├── services/
1110│ │ └── table_services.py
11+ │ ├── interfaces/
1212│ └── table_extraction_use_case.py
13- ├── adaptors/
14- │ └── table_extraction_ports.py
1513├── frameworks/
1614│ ├── ocr_service.py
1715│ ├── pdf_image_extractor.py
18- │ ├── table_extraction_factory .py
16+ │ ├── pipeline .py
1917│ ├── table_structure_recognizer.py
2018│ └── table_transformer_detector.py
2119```
@@ -35,57 +33,44 @@ pdf2table/
3533- ** table_extraction_use_case.py** : Application business logic
3634 - ` TableExtractionUseCase ` : Orchestrates table extraction workflow
3735 - ` extract_tables(pdf_path, page_number=None) ` : Main extraction method
36+ - Returns list of ` DetectedTable ` objects
3837 - ` TableGridBuilder ` : Builds structured grids from detected cells
3938 - Contains the core algorithms for grouping rows/columns and building grids
4039- ** services/table_services.py** : Supporting services for use cases
4140 - ` TableValidationService ` : Validates detected table structures and cells
4241 - ` CoordinateClusteringService ` : Clusters coordinates for row/column grouping
43- - ** dtos.py** : Data transfer objects for use cases
44- - ` TableExtractionResponse ` : Response DTO for table extraction
42+ - ** interfaces/** : Port interfaces for dependency inversion
4543
46- ### 3. Interface Adapters Layer (` pdf2table/adaptors/ ` )
47- - ** table_extraction_adaptor.py** : Adapter for table extraction
48- - ` TableExtractionAdapter ` : Coordinates between use cases and external interfaces
49- - ` extract_tables(pdf_path, page_number=None) ` : Main adapter method
50- - Accepts ` pdf_path ` and optional ` page_number `
51- - Returns ` TableExtractionResponse `
52-
53- ### 4. Frameworks & Drivers Layer (` pdf2table/frameworks/ ` )
44+ ### 3. Frameworks & Drivers Layer (` pdf2table/frameworks/ ` )
5445- ** pdf_image_extractor.py** : PyMuPDF implementation
5546- ** table_transformer_detector.py** : Table detection using Transformer models
5647- ** table_structure_recognizer.py** : Structure recognition using Transformer models
5748- ** ocr_service.py** : TrOCR text extraction
58- - ** table_extraction_factory.py** : Dependency injection and configuration
59-
60-
61- ### Usage (Simple)
62- ``` python
63- from pdf2table.frameworks.table_extraction_factory import TableExtractionService
64-
65- service = TableExtractionService(device = " cpu" )
49+ - ** pipeline.py** : Factory for creating configured pipelines
6650
67- # Extract from a specific page
68- result = service.extract_tables_from_page(pdf_path, page_number = 0 )
69- tables = result[" tables" ]
51+ ## Usage
7052
71- # Or extract from all pages
72- all_results = service.extract_tables_from_pdf(pdf_path)
73- ```
74-
75- ### Usage (Advanced)
7653``` python
77- from pdf2table.frameworks.table_extraction_factory import TableExtractionFactory
54+ from pdf2table.frameworks.pipeline import create_pipeline
7855
79- # Create with custom configuration
80- adapter = TableExtractionFactory.create_table_extraction_adapter(
81- device = " cuda" ,
82- detection_threshold = 0.95 ,
83- structure_threshold = 0.7
56+ # Create the extraction pipeline
57+ use_case = create_pipeline(
58+ device = " cpu" ,
59+ detection_threshold = 0.9 ,
60+ structure_threshold = 0.6 ,
61+ pdf_dpi = 300 ,
62+ load_ocr = False ,
63+ visualize = False
8464)
8565
8666# Extract from a specific page
87- response = adapter .extract_tables(pdf_path, page_number = 0 )
67+ tables = use_case .extract_tables(pdf_path, page_number = 0 )
8868
8969# Or extract from all pages
90- response = adapter.extract_tables(pdf_path)
70+ all_tables = use_case.extract_tables(pdf_path)
71+
72+ # Process the results
73+ for table in tables:
74+ print (f " Found table with { table.grid.n_rows} rows and { table.grid.n_cols} columns " )
75+ table_dict = table.to_dict()
9176```
0 commit comments