Skip to content

Commit 1d39990

Browse files
committed
docs: update README with detailed features and architecture overview; add sample image
1 parent 12a04bc commit 1d39990

File tree

2 files changed

+42
-6
lines changed

2 files changed

+42
-6
lines changed

README.md

Lines changed: 42 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,35 @@
22

33
A Python library for detecting, extracting, and processing tables from PDF documents.
44

5+
![Pdf2Table Sample](docs/sample.png)
6+
57
## Overview
68

7-
This project provides a robust solution for extracting tabular data from PDF documents. The library utilizes advanced computer vision models to detect and recognize table structures, making it easy to convert PDF tables into structured data formats.
9+
This project provides a robust solution for extracting tabular data from PDF documents using state-of-the-art computer vision models.
10+
11+
The extraction pipeline implements a **4-step methodology**:
12+
1. **PDF Rendering** - Convert pages to high-resolution images with text extraction
13+
2. **Table Detection** - Identify table regions using transformer models
14+
3. **Structure Recognition** - Detect cells, rows, columns, and headers
15+
4. **Grid Construction** - Build structured grids with intelligent text extraction (direct PDF + OCR fallback)
816

917
## Technologies Used
1018

11-
- **Table Transformer**: For recognizing table structures in PDF documents.
12-
- **PyMuPDF**: For reading PDF files.
19+
- **Table Transformer Models** (Microsoft Research): Detection and structure recognition
20+
- **PyMuPDF (fitz)**: PDF processing and rendering with direct text extraction
21+
- **TrOCR** (Microsoft): Optional OCR fallback for scanned documents
22+
- **Transformers** (Hugging Face): Model inference pipeline
23+
- **Pydantic**: Entity validation and data modeling
1324

1425
## Features
1526

16-
- PDF processing with page-by-page table detection
17-
- Table structure recognition using Table Transformer
18-
- Clean architecture with separation of concerns
27+
- **Advanced Table Detection**: High-confidence table region identification using transformer models
28+
- **Structure Recognition**: Automatic detection of cells, merged cells, headers, rows, and columns
29+
- **Intelligent Text Extraction**: Two-stage approach (direct PDF text + OCR fallback)
30+
- **Flexible Configuration**: Customizable thresholds, DPI, and processing options
31+
- **Clean Architecture**: Well-organized codebase following software engineering best practices
32+
- **Comprehensive Logging**: Detailed logs for debugging and monitoring
33+
- **Multiple Output Formats**: JSON export with metadata and structured data
1934

2035
## Installation
2136

@@ -69,9 +84,20 @@ The `create_pipeline()` method accepts the following parameters:
6984

7085
- `device` (str): Device for ML models - "cpu" or "cuda" (default: "cpu")
7186
- `detection_threshold` (float): Confidence threshold for table detection (default: 0.9)
87+
- High (0.9+): Fewer false positives, may miss some tables
88+
- Medium (0.7-0.9): Balanced detection
89+
- Low (<0.7): More tables detected, more false positives
7290
- `structure_threshold` (float): Confidence threshold for structure recognition (default: 0.6)
91+
- High (0.8+): Clean structure, may miss some cells
92+
- Medium (0.5-0.8): Balanced recognition (recommended)
93+
- Low (<0.5): Detects more cells, more noise
7394
- `pdf_dpi` (int): DPI for PDF page rendering (default: 300)
95+
- 150: Faster processing, lower quality
96+
- 300: Balanced (recommended)
97+
- 600+: Better quality, slower processing
7498
- `load_ocr` (bool): Whether to load OCR service (default: False)
99+
- False: Direct PDF text extraction only (faster, recommended for native PDFs)
100+
- True: Enable TrOCR fallback (essential for scanned documents)
75101
- `visualize` (bool): Whether to enable visualization (default: False)
76102
- `visualization_save_dir` (str): Directory to save visualizations (default: "data/table_visualizations")
77103

@@ -85,6 +111,16 @@ The project includes comprehensive logging capabilities for debugging and monito
85111

86112
**Documentation**: See `docs/logging_guide.md` for detailed logging documentation.
87113

114+
## 🏗️ Architecture
115+
116+
The project follows **Clean Architecture** with three main layers:
117+
118+
- **Entities Layer**: Core business objects (PageImage, DetectedTable, TableGrid, GridCell)
119+
- **Use Cases Layer**: Business logic orchestration (TableExtractionUseCase, TableGridBuilder)
120+
- **Frameworks Layer**: External tools (PyMuPDF, Table Transformer, TrOCR)
121+
122+
For detailed technical documentation, see `docs/technical_report/technical_report.md`.
123+
88124
## 🎯 Use Cases
89125

90126
### Document Processing Pipelines

docs/sample.png

427 KB
Loading

0 commit comments

Comments
 (0)