You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+42-6Lines changed: 42 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,20 +2,35 @@
2
2
3
3
A Python library for detecting, extracting, and processing tables from PDF documents.
4
4
5
+

6
+
5
7
## Overview
6
8
7
-
This project provides a robust solution for extracting tabular data from PDF documents. The library utilizes advanced computer vision models to detect and recognize table structures, making it easy to convert PDF tables into structured data formats.
9
+
This project provides a robust solution for extracting tabular data from PDF documents using state-of-the-art computer vision models.
10
+
11
+
The extraction pipeline implements a **4-step methodology**:
12
+
1.**PDF Rendering** - Convert pages to high-resolution images with text extraction
13
+
2.**Table Detection** - Identify table regions using transformer models
14
+
3.**Structure Recognition** - Detect cells, rows, columns, and headers
15
+
4.**Grid Construction** - Build structured grids with intelligent text extraction (direct PDF + OCR fallback)
8
16
9
17
## Technologies Used
10
18
11
-
-**Table Transformer**: For recognizing table structures in PDF documents.
12
-
-**PyMuPDF**: For reading PDF files.
19
+
-**Table Transformer Models** (Microsoft Research): Detection and structure recognition
20
+
-**PyMuPDF (fitz)**: PDF processing and rendering with direct text extraction
21
+
-**TrOCR** (Microsoft): Optional OCR fallback for scanned documents
22
+
-**Transformers** (Hugging Face): Model inference pipeline
23
+
-**Pydantic**: Entity validation and data modeling
13
24
14
25
## Features
15
26
16
-
- PDF processing with page-by-page table detection
17
-
- Table structure recognition using Table Transformer
18
-
- Clean architecture with separation of concerns
27
+
-**Advanced Table Detection**: High-confidence table region identification using transformer models
28
+
-**Structure Recognition**: Automatic detection of cells, merged cells, headers, rows, and columns
29
+
-**Intelligent Text Extraction**: Two-stage approach (direct PDF text + OCR fallback)
30
+
-**Flexible Configuration**: Customizable thresholds, DPI, and processing options
31
+
-**Clean Architecture**: Well-organized codebase following software engineering best practices
32
+
-**Comprehensive Logging**: Detailed logs for debugging and monitoring
33
+
-**Multiple Output Formats**: JSON export with metadata and structured data
19
34
20
35
## Installation
21
36
@@ -69,9 +84,20 @@ The `create_pipeline()` method accepts the following parameters:
69
84
70
85
-`device` (str): Device for ML models - "cpu" or "cuda" (default: "cpu")
71
86
-`detection_threshold` (float): Confidence threshold for table detection (default: 0.9)
87
+
- High (0.9+): Fewer false positives, may miss some tables
88
+
- Medium (0.7-0.9): Balanced detection
89
+
- Low (<0.7): More tables detected, more false positives
72
90
-`structure_threshold` (float): Confidence threshold for structure recognition (default: 0.6)
91
+
- High (0.8+): Clean structure, may miss some cells
92
+
- Medium (0.5-0.8): Balanced recognition (recommended)
93
+
- Low (<0.5): Detects more cells, more noise
73
94
-`pdf_dpi` (int): DPI for PDF page rendering (default: 300)
95
+
- 150: Faster processing, lower quality
96
+
- 300: Balanced (recommended)
97
+
- 600+: Better quality, slower processing
74
98
-`load_ocr` (bool): Whether to load OCR service (default: False)
99
+
- False: Direct PDF text extraction only (faster, recommended for native PDFs)
100
+
- True: Enable TrOCR fallback (essential for scanned documents)
75
101
-`visualize` (bool): Whether to enable visualization (default: False)
76
102
-`visualization_save_dir` (str): Directory to save visualizations (default: "data/table_visualizations")
77
103
@@ -85,6 +111,16 @@ The project includes comprehensive logging capabilities for debugging and monito
85
111
86
112
**Documentation**: See `docs/logging_guide.md` for detailed logging documentation.
87
113
114
+
## 🏗️ Architecture
115
+
116
+
The project follows **Clean Architecture** with three main layers:
117
+
118
+
-**Entities Layer**: Core business objects (PageImage, DetectedTable, TableGrid, GridCell)
119
+
-**Use Cases Layer**: Business logic orchestration (TableExtractionUseCase, TableGridBuilder)
0 commit comments