Skip to content

Commit 7a33115

Browse files
committed
refactor:
- removed the adaptor - streamline the pipeline creation
1 parent e0965c5 commit 7a33115

File tree

8 files changed

+130
-260
lines changed

8 files changed

+130
-260
lines changed

README.md

Lines changed: 24 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -37,51 +37,45 @@ pip install -e .
3737

3838
### Usage
3939
```python
40-
from pdf2table.frameworks.table_extraction_factory import TableExtractionFactory
41-
42-
# Initialize the factory
43-
factory = TableExtractionFactory()
44-
adapter = factory.create_table_extraction_adapter()
40+
from pdf2table.frameworks.pipeline import create_pipeline
41+
42+
# Create the extraction pipeline with configuration
43+
pipeline = create_pipeline(
44+
device="cpu",
45+
detection_threshold=0.9,
46+
structure_threshold=0.6,
47+
pdf_dpi=300,
48+
load_ocr=False,
49+
visualize=False
50+
)
4551

4652
# Extract tables from a specific page
47-
response = adapter.extract_tables(pdf_path="document.pdf", page_number=0)
53+
tables = pipeline.extract_tables(pdf_path="document.pdf", page_number=0)
4854

4955
# Or extract tables from all pages
50-
response = adapter.extract_tables(pdf_path="document.pdf")
56+
all_tables = pipeline.extract_tables(pdf_path="document.pdf")
5157

5258
# Access extracted tables
53-
for table in response.tables:
59+
for table in tables:
5460
print(f"Table with {len(table.grid.cells)} cells")
55-
print(f"Grid size: {table.grid.rows} x {table.grid.columns}")
61+
print(f"Grid size: {table.grid.n_rows} x {table.grid.n_cols}")
5662

5763
# Convert to structured format
5864
table_data = table.to_dict()
5965
print(table_data)
6066
```
6167

62-
### High-Level Usage
63-
64-
For simpler integration, use the high-level `TableExtractionService`:
65-
66-
```python
67-
from pdf2table.frameworks.table_extraction_factory import TableExtractionService
68+
### Configuration Options
6869

69-
# Initialize the service
70-
service = TableExtractionService(device="cpu")
70+
The `create_pipeline()` method accepts the following parameters:
7171

72-
# Extract tables from a single page
73-
page_result = service.extract_tables_from_page("document.pdf", page_number=0)
74-
print(f"Found {len(page_result['tables'])} tables on page 0")
75-
76-
# Extract tables from entire PDF (all pages)
77-
all_results = service.extract_tables_from_pdf("document.pdf")
78-
tables = all_results.get('tables', [])
79-
print(f"Found {len(tables)} total tables across all pages")
80-
81-
# Process each table
82-
for table_idx, table in enumerate(tables):
83-
print(f" Table {table_idx + 1}: {table['metadata']}")
84-
```
72+
- `device` (str): Device for ML models - "cpu" or "cuda" (default: "cpu")
73+
- `detection_threshold` (float): Confidence threshold for table detection (default: 0.9)
74+
- `structure_threshold` (float): Confidence threshold for structure recognition (default: 0.6)
75+
- `pdf_dpi` (int): DPI for PDF page rendering (default: 300)
76+
- `load_ocr` (bool): Whether to load OCR service (default: False)
77+
- `visualize` (bool): Whether to enable visualization (default: False)
78+
- `visualization_save_dir` (str): Directory to save visualizations (default: "data/table_visualizations")
8579

8680
## 📋 Logging
8781

docs/architecture_guide.md

Lines changed: 23 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,14 @@ pdf2table/
66
├── entities/
77
│ └── table_entities.py
88
├── usecases/
9-
│ ├── dtos.py
109
│ ├── services/
1110
│ │ └── table_services.py
11+
│ ├── interfaces/
1212
│ └── table_extraction_use_case.py
13-
├── adaptors/
14-
│ └── table_extraction_ports.py
1513
├── frameworks/
1614
│ ├── ocr_service.py
1715
│ ├── pdf_image_extractor.py
18-
│ ├── table_extraction_factory.py
16+
│ ├── pipeline.py
1917
│ ├── table_structure_recognizer.py
2018
│ └── table_transformer_detector.py
2119
```
@@ -35,57 +33,44 @@ pdf2table/
3533
- **table_extraction_use_case.py**: Application business logic
3634
- `TableExtractionUseCase`: Orchestrates table extraction workflow
3735
- `extract_tables(pdf_path, page_number=None)`: Main extraction method
36+
- Returns list of `DetectedTable` objects
3837
- `TableGridBuilder`: Builds structured grids from detected cells
3938
- Contains the core algorithms for grouping rows/columns and building grids
4039
- **services/table_services.py**: Supporting services for use cases
4140
- `TableValidationService`: Validates detected table structures and cells
4241
- `CoordinateClusteringService`: Clusters coordinates for row/column grouping
43-
- **dtos.py**: Data transfer objects for use cases
44-
- `TableExtractionResponse`: Response DTO for table extraction
42+
- **interfaces/**: Port interfaces for dependency inversion
4543

46-
### 3. Interface Adapters Layer (`pdf2table/adaptors/`)
47-
- **table_extraction_adaptor.py**: Adapter for table extraction
48-
- `TableExtractionAdapter`: Coordinates between use cases and external interfaces
49-
- `extract_tables(pdf_path, page_number=None)`: Main adapter method
50-
- Accepts `pdf_path` and optional `page_number`
51-
- Returns `TableExtractionResponse`
52-
53-
### 4. Frameworks & Drivers Layer (`pdf2table/frameworks/`)
44+
### 3. Frameworks & Drivers Layer (`pdf2table/frameworks/`)
5445
- **pdf_image_extractor.py**: PyMuPDF implementation
5546
- **table_transformer_detector.py**: Table detection using Transformer models
5647
- **table_structure_recognizer.py**: Structure recognition using Transformer models
5748
- **ocr_service.py**: TrOCR text extraction
58-
- **table_extraction_factory.py**: Dependency injection and configuration
59-
60-
61-
### Usage (Simple)
62-
```python
63-
from pdf2table.frameworks.table_extraction_factory import TableExtractionService
64-
65-
service = TableExtractionService(device="cpu")
49+
- **pipeline.py**: Factory for creating configured pipelines
6650

67-
# Extract from a specific page
68-
result = service.extract_tables_from_page(pdf_path, page_number=0)
69-
tables = result["tables"]
51+
## Usage
7052

71-
# Or extract from all pages
72-
all_results = service.extract_tables_from_pdf(pdf_path)
73-
```
74-
75-
### Usage (Advanced)
7653
```python
77-
from pdf2table.frameworks.table_extraction_factory import TableExtractionFactory
54+
from pdf2table.frameworks.pipeline import create_pipeline
7855

79-
# Create with custom configuration
80-
adapter = TableExtractionFactory.create_table_extraction_adapter(
81-
device="cuda",
82-
detection_threshold=0.95,
83-
structure_threshold=0.7
56+
# Create the extraction pipeline
57+
use_case = create_pipeline(
58+
device="cpu",
59+
detection_threshold=0.9,
60+
structure_threshold=0.6,
61+
pdf_dpi=300,
62+
load_ocr=False,
63+
visualize=False
8464
)
8565

8666
# Extract from a specific page
87-
response = adapter.extract_tables(pdf_path, page_number=0)
67+
tables = use_case.extract_tables(pdf_path, page_number=0)
8868

8969
# Or extract from all pages
90-
response = adapter.extract_tables(pdf_path)
70+
all_tables = use_case.extract_tables(pdf_path)
71+
72+
# Process the results
73+
for table in tables:
74+
print(f"Found table with {table.grid.n_rows} rows and {table.grid.n_cols} columns")
75+
table_dict = table.to_dict()
9176
```

pdf2table/adaptors/__init__.py

Whitespace-only changes.

pdf2table/adaptors/table_extraction_adaptor.py

Lines changed: 0 additions & 33 deletions
This file was deleted.

pdf2table/frameworks/pipeline.py

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
from typing import Optional
2+
from pdf2table.usecases.table_extraction_use_case import TableExtractionUseCase
3+
from pdf2table.frameworks.pdf_image_extractor import PyMuPDFImageExtractor
4+
from pdf2table.frameworks.table_transformer_detector import TableTransformerDetector
5+
from pdf2table.frameworks.table_structure_recognizer import (
6+
TableTransformerStructureRecognizer,
7+
)
8+
from pdf2table.frameworks.ocr_service import TrOCRService
9+
from pdf2table.frameworks.logging_config import get_logger
10+
11+
12+
logger = get_logger(__name__)
13+
14+
15+
def create_pipeline(
16+
device: str = "cpu",
17+
detection_threshold: float = 0.9,
18+
structure_threshold: float = 0.6,
19+
pdf_dpi: int = 300,
20+
load_ocr: bool = False,
21+
visualize: bool = False,
22+
visualization_save_dir: str = "data/table_visualizations",
23+
) -> TableExtractionUseCase:
24+
"""
25+
Create a fully configured table extraction pipeline.
26+
27+
Args:
28+
device: Device to use for ML models ('cpu' or 'cuda')
29+
detection_threshold: Confidence threshold for table detection
30+
structure_threshold: Confidence threshold for structure recognition
31+
pdf_dpi: DPI for PDF page rendering
32+
load_ocr: Whether to load OCR service
33+
visualize: Whether to enable visualization
34+
visualization_save_dir: Directory to save visualizations
35+
36+
Returns:
37+
TableExtractionUseCase: Configured use case ready for table extraction
38+
"""
39+
logger.info(
40+
f"Creating table extraction pipeline - Device: {device}, "
41+
f"Detection threshold: {detection_threshold}, "
42+
f"Structure threshold: {structure_threshold}, "
43+
f"PDF DPI: {pdf_dpi}, OCR: {load_ocr}, Visualize: {visualize}"
44+
)
45+
46+
logger.debug("Initializing PDF image extractor")
47+
pdf_extractor = PyMuPDFImageExtractor(dpi=pdf_dpi)
48+
49+
logger.debug("Initializing table transformer detector")
50+
table_detector = TableTransformerDetector(
51+
device=device, confidence_threshold=detection_threshold
52+
)
53+
54+
logger.debug("Initializing table structure recognizer")
55+
structure_recognizer = TableTransformerStructureRecognizer(
56+
device=device, confidence_threshold=structure_threshold
57+
)
58+
59+
ocr_service: Optional[TrOCRService] = None
60+
if load_ocr:
61+
logger.debug("Initializing OCR service")
62+
ocr_service = TrOCRService(device=device)
63+
else:
64+
logger.debug("OCR service disabled")
65+
66+
logger.debug("Creating table extraction use case")
67+
use_case = TableExtractionUseCase(
68+
pdf_extractor=pdf_extractor,
69+
table_detector=table_detector,
70+
structure_recognizer=structure_recognizer,
71+
ocr_service=ocr_service,
72+
visualize=visualize,
73+
visualization_save_dir=visualization_save_dir,
74+
)
75+
76+
logger.info("Table extraction pipeline created successfully")
77+
return use_case

0 commit comments

Comments
 (0)