Skip to content
This repository was archived by the owner on Dec 2, 2025. It is now read-only.

Commit 966bf3f

Browse files
refactor(llm): make extraction pipeline generic for any document type
- Move receipt-specific schemas to schemas/receipt_schema.py - Add dynamic schema support via output_cls parameter - Remove hardcoded Receipt model from main pipeline - Add optional transform_fn for custom data transformations - Remove ground truth comparison code for simplicity - Update dependencies (remove rapidfuzz, add pandas/pillow) - Add invoice extraction example (example_invoice.py) - Add comprehensive README with usage examples
1 parent 8f74aa2 commit 966bf3f

File tree

5 files changed

+325
-190
lines changed

5 files changed

+325
-190
lines changed
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Generic Document Extraction Pipeline
2+
3+
A flexible, schema-driven pipeline for extracting structured data from any type of document or image using LlamaParse and OpenAI.
4+
5+
## Features
6+
7+
- **Dynamic Schema Support**: Use any Pydantic model to define your extraction schema
8+
- **Optional Preprocessing**: Scale and optimize images before extraction
9+
- **Flexible Transformations**: Apply custom transformation functions to extracted data
10+
- **Extensible**: Easy to adapt for receipts, invoices, forms, IDs, or any document type
11+
12+
## Quick Start
13+
14+
### 1. Define Your Schema
15+
16+
Create a Pydantic model for your document type:
17+
18+
```python
19+
from pydantic import BaseModel, Field
20+
21+
class Invoice(BaseModel):
22+
invoice_number: str = Field(description="Invoice number")
23+
vendor_name: str = Field(description="Vendor name")
24+
total_amount: float = Field(description="Total amount")
25+
```
26+
27+
### 2. Run Extraction
28+
29+
```python
30+
from extract_receipts_pipeline import main
31+
32+
result_df = main(
33+
image_paths=["invoice1.pdf", "invoice2.pdf"],
34+
output_cls=Invoice,
35+
prompt="Extract invoice data from: {context_str}",
36+
id_column="invoice_id",
37+
)
38+
```
39+
40+
## Usage Examples
41+
42+
### Basic Extraction (No Ground Truth)
43+
44+
```python
45+
from schemas.receipt_schema import Receipt
46+
from extract_receipts_pipeline import main
47+
48+
result = main(
49+
image_paths=["receipt1.jpg"],
50+
output_cls=Receipt,
51+
prompt="Extract receipt data: {context_str}",
52+
)
53+
```
54+
55+
### With Preprocessing
56+
57+
```python
58+
from pathlib import Path
59+
60+
result = main(
61+
image_paths=["low_res.jpg"],
62+
output_cls=Receipt,
63+
prompt="Extract data: {context_str}",
64+
preprocess=True,
65+
output_dir=Path("processed_images"),
66+
scale_factor=3,
67+
)
68+
```
69+
70+
### With Custom Transformations
71+
72+
```python
73+
import pandas as pd
74+
75+
def transform_data(df: pd.DataFrame) -> pd.DataFrame:
76+
df["vendor"] = df["vendor"].str.upper()
77+
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
78+
return df
79+
80+
result = main(
81+
image_paths=["invoice.pdf"],
82+
output_cls=Invoice,
83+
prompt="Extract: {context_str}",
84+
transform_fn=transform_data,
85+
)
86+
```
87+
88+
## Parameters
89+
90+
### Required
91+
- `image_paths`: List of document/image paths
92+
- `output_cls`: Pydantic model class for extraction
93+
- `prompt`: Extraction prompt template (must include `{context_str}`)
94+
95+
### Optional
96+
- `id_column`: Document ID column name (default: "document_id")
97+
- `fields`: Fields to extract (default: all model fields)
98+
- `preprocess`: Enable image preprocessing (default: False)
99+
- `output_dir`: Directory for preprocessed images
100+
- `scale_factor`: Image scaling factor (default: 3)
101+
- `transform_fn`: Custom transformation function
102+
103+
## File Structure
104+
105+
```
106+
llm/smart_data_extraction_llamaindex/
107+
├── extract_receipts_pipeline.py # Main pipeline
108+
├── schemas/
109+
│ ├── __init__.py
110+
│ └── receipt_schema.py # Receipt schema example
111+
├── example_invoice.py # Invoice extraction example
112+
└── README.md # This file
113+
```
114+
115+
## Custom Schema Examples
116+
117+
See:
118+
- `schemas/receipt_schema.py` - Receipt extraction
119+
- `example_invoice.py` - Invoice extraction example
120+
121+
Create your own schemas in the `schemas/` directory!
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
"""Example: Using the generic extraction pipeline with a custom invoice schema."""
2+
3+
from typing import Optional
4+
5+
import pandas as pd
6+
7+
# Import the generic extraction pipeline
8+
from extract_receipts_pipeline import main
9+
from pydantic import BaseModel, Field
10+
11+
12+
# Define custom schema for invoices
13+
class Invoice(BaseModel):
14+
"""Invoice extraction schema."""
15+
16+
invoice_number: str = Field(description="Invoice number or ID")
17+
vendor_name: str = Field(description="Vendor or supplier name")
18+
invoice_date: Optional[str] = Field(default=None, description="Invoice date")
19+
total_amount: float = Field(description="Total invoice amount")
20+
tax_amount: Optional[float] = Field(default=None, description="Tax amount if present")
21+
22+
23+
# Optional: Define transformation function
24+
def transform_invoice_data(df: pd.DataFrame) -> pd.DataFrame:
25+
"""Transform invoice data."""
26+
df = df.copy()
27+
df["vendor_name"] = df["vendor_name"].str.upper()
28+
df["total_amount"] = pd.to_numeric(df["total_amount"], errors="coerce")
29+
df["tax_amount"] = pd.to_numeric(df["tax_amount"], errors="coerce")
30+
return df
31+
32+
33+
# Define extraction prompt
34+
INVOICE_PROMPT = """
35+
You are extracting structured data from an invoice document.
36+
Use the provided text to populate the Invoice model accurately.
37+
If a field is not present in the document, return null.
38+
39+
{context_str}
40+
"""
41+
42+
43+
if __name__ == "__main__":
44+
# Example usage - replace with your actual invoice paths
45+
invoice_paths = [
46+
"path/to/invoice1.pdf",
47+
"path/to/invoice2.pdf",
48+
]
49+
50+
# Run extraction
51+
result_df = main(
52+
image_paths=invoice_paths,
53+
output_cls=Invoice,
54+
prompt=INVOICE_PROMPT,
55+
id_column="invoice_id",
56+
fields=["invoice_number", "vendor_name", "invoice_date", "total_amount", "tax_amount"],
57+
transform_fn=transform_invoice_data,
58+
)
59+
60+
print("\nExtracted Invoices:")
61+
print(result_df)
62+
63+
# Save results
64+
result_df.to_csv("extracted_invoices.csv", index=False)
65+
print("\nResults saved to extracted_invoices.csv")

0 commit comments

Comments
 (0)