Enterprise-Grade Receipt OCR & Extraction Powered by PaddleOCR-VL + LangChain
English | 中文
Powered by langchain-paddleocr - The Official LangChain Integration for PaddleOCR
|
PaddleOCR-VL-1.5 Exclusive Auto-detect seal types for authenticity verification |
ERINE 4.5 Powered Structured JSON output with smart categorization |
Production Ready Tool-based architecture with Agent support |
- Python 3.10 or higher
- uv (recommended) or pip
Visit PaddleOCR Official Website and register:
- Click "API" button on the model service page
- Get and copy your PaddleOCR-VL-1.5 and PP-OCRv5 credentials:
- TOKEN (Access Token) - for API authentication
- API_URL - Service endpoint address
💡 Tip: PaddleOCR supports free parsing of tens of thousands of document pages per day!
⚠️ Important: ERINE LLM Token is the same as PaddleOCR Token - no extra setup needed!
With a single PaddleOCR Token, you can use both:
- ✅ PaddleOCR-VL-1.5 (OCR Recognition + Seal Detection)
- ✅ ERINE 4.5 (Intelligent Information Extraction)
# Install uv (if not installed)
pip install uv
# Clone the repository
git clone https://github.com/AIwork4me/smart-receipt-assistant.git
cd smart-receipt-assistant
# Install dependencies
uv sync
# Configure environment
cp .env.example .env
# Edit .env and fill in your PaddleOCR TOKEN and API_URL
# Launch Web UI
uv run python app.pyEdit .env:
# PaddleOCR Token (from www.paddleocr.com)
PADDLEOCR_ACCESS_TOKEN=your_token_here
PADDLEOCR_API_URL=your_api_url_here
# ERINE uses the same token as PaddleOCR - no extra config needed!uv run python app.pyOpen http://localhost:7860 in your browser.
# Recognize single receipt
uv run python -m src.main recognize invoice.jpg
# Save results to JSON
uv run python -m src.main recognize invoice.jpg --output result.json
# Batch processing
uv run python -m src.main batch ./invoices/ --output ./results/from src.agents import create_receipt_agent
# Create agent with your API key
agent = create_receipt_agent(api_key="your_token")
# Process receipt
result = agent.process("invoice.jpg")
print(result["output"])from src.tools import ReceiptOCRTool, ReceiptExtractionTool
# Use as LangChain tools
ocr_tool = ReceiptOCRTool(api_key="your_token")
result = ocr_tool.invoke({"file_path": "invoice.jpg"})
print(result["text"])| Type | OCR | Extraction | Seal Detection |
|---|---|---|---|
| VAT Special Invoice (增值税专用发票) | ✅ | ✅ | ✅ |
| VAT General Invoice (增值税普通发票) | ✅ | ✅ | ✅ |
| Train Ticket (火车票) | ✅ | ✅ | - |
| Taxi Receipt (出租车票) | ✅ | ✅ | - |
| Fixed Amount Invoice (定额发票) | ✅ | ✅ | ✅ |
| Other Receipts | ✅ | ✅ | ✅ |
| Seal Type | Description | Verification Value |
|---|---|---|
| 发票专用章 (Invoice Seal) | Official invoice seal | ⭐⭐⭐ High - Strong authenticity indicator |
| 财务专用章 (Finance Seal) | Finance department seal | ⭐⭐ Medium - Supporting evidence |
| 公章 (Company Seal) | Official company seal | ⭐⭐ Medium - Supporting evidence |
| 发票监制章 (Tax Authority Seal) | Pre-printed tax seal | ⭐ Low - Present on all invoices |
⚠️ Note: Seal recognition is an auxiliary verification method. For official verification, use the National Tax Verification Platform.
smart-receipt-assistant/
├── app.py # Gradio entry point
├── pyproject.toml # Project config (uv)
├── src/
│ ├── main.py # CLI entry
│ ├── config.py # Configuration
│ ├── langchain_compat.py # LangChain compatibility layer
│ ├── chains/ # LangChain Chains
│ │ ├── ocr_chain.py # OCR chain (langchain-paddleocr)
│ │ ├── extraction_chain.py # Information extraction
│ │ └── classification_chain.py
│ ├── tools/ # LangChain Tools
│ │ ├── ocr_tool.py # ReceiptOCRTool
│ │ ├── extraction_tool.py # ReceiptExtractionTool
│ │ └── classification_tool.py
│ ├── agents/ # LangChain Agents
│ │ └── receipt_agent.py # ReceiptAgentExecutor
│ ├── models/ # Pydantic models
│ └── utils/ # Utilities
├── examples/ # Example code & samples
├── tests/ # Test suite
└── docs/ # Documentation
uv run pytest tests/ -vThis project uses:
- Type hints (Python 3.10+)
- Pydantic v2 for data validation
- LangChain 1.0+ API
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- PaddleOCR - PaddlePaddle OCR Framework
- langchain-paddleocr - Official LangChain Integration
- LangChain - LLM Application Framework
- Baidu AIStudio - AI Development Platform
Made with ❤️ by AIwork4me

