You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+32-10Lines changed: 32 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,34 +1,29 @@
1
1
# Pdf2Table
2
2
3
-
A RAG (Retrieval-Augmented Generation) application for detecting, extracting, and indexing tables from PDF documents and finally inferring on them.
3
+
A Python library for detecting, extracting, and processing tables from PDF documents.
4
4
5
5
## Overview
6
6
7
-
This project aims to provide a robust solution for extracting tabular data from PDF documents and indexing it for efficient retrieval. The application utilizes various technologies, including FastAPI for the web framework, Elasticsearch for indexing and searching, and LangChain for text chunking and processing.
7
+
This project provides a robust solution for extracting tabular data from PDF documents. The library utilizes advanced computer vision models to detect and recognize table structures, making it easy to convert PDF tables into structured data formats.
8
8
9
9
## Technologies Used
10
10
11
-
-**FastAPI**: For building the web application.
12
-
-**Elasticsearch**: For storing and retrieving indexed data.
13
-
-**LangChain**: For text chunking and processing.
14
11
-**Table Transformer**: For recognizing table structures in PDF documents.
15
12
-**PyMuPDF**: For reading PDF files.
16
13
17
14
## Features
18
15
19
16
- PDF processing with page-by-page table detection
20
17
- Table structure recognition using Table Transformer
21
-
- Text chunking with LangChain's character splitter
22
-
- Elasticsearch indexing for structured retrieval
23
18
- Clean architecture with separation of concerns
24
19
25
20
## Project Structure
26
21
27
22
-`pdf2table/`: Main package
28
-
-`adaptors/`: Interface with external systems (Elasticsearch, PDF reader, Table Transformer)
23
+
-`adaptors/`: Interface with external systems(PDF reader, Table Transformer)
29
24
-`entities/`: Domain models
30
25
-`usecases/`: Application logic
31
-
-`frameworks/`: UI and infrastructure (FastAPI)
26
+
-`frameworks/`: UI and infrastructure
32
27
-`tests/`: Unit tests
33
28
-`adaptors/`: Tests for adaptors
34
29
-`samples/`: Sample PDFs for testing
@@ -63,6 +58,33 @@ for table in response.tables:
63
58
print(table_data)
64
59
```
65
60
61
+
### High-Level Usage
62
+
63
+
For simpler integration, use the high-level `TableExtractionService`:
64
+
65
+
```python
66
+
from pdf2table.frameworks.table_extraction_factory import TableExtractionService
0 commit comments