Skip to content

Commit 919c203

Browse files
committed
docs: enhance high-level usage examples in README
1 parent 90c13aa commit 919c203

File tree

1 file changed

+32
-10
lines changed

1 file changed

+32
-10
lines changed

README.md

Lines changed: 32 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,29 @@
11
# Pdf2Table
22

3-
A RAG (Retrieval-Augmented Generation) application for detecting, extracting, and indexing tables from PDF documents and finally inferring on them.
3+
A Python library for detecting, extracting, and processing tables from PDF documents.
44

55
## Overview
66

7-
This project aims to provide a robust solution for extracting tabular data from PDF documents and indexing it for efficient retrieval. The application utilizes various technologies, including FastAPI for the web framework, Elasticsearch for indexing and searching, and LangChain for text chunking and processing.
7+
This project provides a robust solution for extracting tabular data from PDF documents. The library utilizes advanced computer vision models to detect and recognize table structures, making it easy to convert PDF tables into structured data formats.
88

99
## Technologies Used
1010

11-
- **FastAPI**: For building the web application.
12-
- **Elasticsearch**: For storing and retrieving indexed data.
13-
- **LangChain**: For text chunking and processing.
1411
- **Table Transformer**: For recognizing table structures in PDF documents.
1512
- **PyMuPDF**: For reading PDF files.
1613

1714
## Features
1815

1916
- PDF processing with page-by-page table detection
2017
- Table structure recognition using Table Transformer
21-
- Text chunking with LangChain's character splitter
22-
- Elasticsearch indexing for structured retrieval
2318
- Clean architecture with separation of concerns
2419

2520
## Project Structure
2621

2722
- `pdf2table/`: Main package
28-
- `adaptors/`: Interface with external systems (Elasticsearch, PDF reader, Table Transformer)
23+
- `adaptors/`: Interface with external systems(PDF reader, Table Transformer)
2924
- `entities/`: Domain models
3025
- `usecases/`: Application logic
31-
- `frameworks/`: UI and infrastructure (FastAPI)
26+
- `frameworks/`: UI and infrastructure
3227
- `tests/`: Unit tests
3328
- `adaptors/`: Tests for adaptors
3429
- `samples/`: Sample PDFs for testing
@@ -63,6 +58,33 @@ for table in response.tables:
6358
print(table_data)
6459
```
6560

61+
### High-Level Usage
62+
63+
For simpler integration, use the high-level `TableExtractionService`:
64+
65+
```python
66+
from pdf2table.frameworks.table_extraction_factory import TableExtractionService
67+
68+
# Initialize the service
69+
service = TableExtractionService(device="cpu")
70+
71+
# Extract tables from a single page
72+
page_result = service.extract_tables_from_page("document.pdf", page_number=0)
73+
print(f"Found {len(page_result['tables'])} tables on page 0")
74+
75+
# Extract tables from entire PDF
76+
all_results = service.extract_tables_from_pdf("document.pdf")
77+
for page_idx, page_result in enumerate(all_results):
78+
if page_result.get('success', True):
79+
tables = page_result.get('tables', [])
80+
print(f"Page {page_idx}: Found {len(tables)} tables")
81+
82+
# Process each table
83+
for table_idx, table in enumerate(tables):
84+
print(f" Table {table_idx + 1}: {table['rows']} rows x {table['columns']} columns")
85+
else:
86+
print(f"Page {page_idx}: Error - {page_result.get('error', 'Unknown error')}")
87+
```
6688

6789
## 🎯 Use Cases
6890

@@ -73,10 +95,10 @@ for table in response.tables:
7395
- **Government Documents**: Process regulatory filings and public documents
7496

7597
### Integration Scenarios
76-
- **RAG Systems**: Index extracted tables for question-answering systems
7798
- **Data Analytics**: Feed extracted data into analytical workflows
7899
- **Document Management**: Enhance document search with structured table data
79100
- **Compliance**: Automated extraction for regulatory compliance reporting
101+
- **Business Intelligence**: Convert PDF reports into structured datasets for analysis
80102

81103
## 🤝 Contributing
82104

0 commit comments

Comments
 (0)