Skip to content

Commit 7521bd4

Browse files
authored
knowledge: support ocr handling (#643)
1 parent 2c3e558 commit 7521bd4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+4059
-529
lines changed

examples/knowledge/OCR/README.md

Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
# PDF OCR Knowledge Demo
2+
3+
This example demonstrates how to use trpc-agent-go's Knowledge module with OCR capabilities to process PDF documents and perform vector storage and retrieval using TCVector.
4+
5+
## Features
6+
7+
- ✅ PDF document reading (supports text and images)
8+
- ✅ OCR text extraction (Tesseract)
9+
- ✅ Automatic document chunking
10+
- ✅ Vector storage with TCVector
11+
- ✅ Semantic search and retrieval
12+
- ✅ Interactive query interface
13+
14+
## Quick Start
15+
16+
```bash
17+
# 1. Set required environment variables
18+
export OPENAI_API_KEY="your-openai-api-key"
19+
export OPENAI_BASE_URL="https://api.openai.com/v1"
20+
export TCVECTOR_URL="http://your-tcvector-host:port"
21+
export TCVECTOR_USERNAME="your-username"
22+
export TCVECTOR_PASSWORD="your-password"
23+
24+
# 2. Prepare PDF files
25+
mkdir -p ./data
26+
cp /path/to/your/*.pdf ./data/
27+
28+
# 3. Install Tesseract OCR
29+
# Ubuntu/Debian:
30+
sudo apt-get install tesseract-ocr libtesseract-dev
31+
32+
# macOS:
33+
brew install tesseract
34+
35+
# 4. Run the example
36+
cd examples/knowledge/OCR
37+
go run main.go
38+
```
39+
40+
## Prerequisites
41+
42+
### 1. OpenAI API Configuration (Required)
43+
44+
This example uses OpenAI Embedder for text vectorization. You need to configure the following environment variables:
45+
46+
```bash
47+
# OpenAI API Key (Required)
48+
export OPENAI_API_KEY="your-openai-api-key"
49+
50+
# OpenAI API Base URL (Required)
51+
# For official OpenAI API:
52+
export OPENAI_BASE_URL="https://api.openai.com/v1"
53+
54+
# For compatible third-party services (e.g., Azure OpenAI, local deployment):
55+
export OPENAI_BASE_URL="https://your-custom-endpoint/v1"
56+
```
57+
58+
**Note**:
59+
- Both `OPENAI_API_KEY` and `OPENAI_BASE_URL` are required environment variables
60+
- The default embedding model is `text-embedding-3-small`
61+
- Ensure your API endpoint supports this model
62+
63+
### 2. TCVector Configuration
64+
65+
You need a running TCVector instance with the following information:
66+
- TCVector URL
67+
- Username
68+
- Password
69+
70+
Configure via environment variables or command-line parameters:
71+
```bash
72+
export TCVECTOR_URL="http://your-tcvector-host:port"
73+
export TCVECTOR_USERNAME="your-username"
74+
export TCVECTOR_PASSWORD="your-password"
75+
```
76+
77+
### 3. Tesseract OCR Engine
78+
79+
Install Tesseract OCR:
80+
```bash
81+
# Ubuntu/Debian
82+
sudo apt-get update
83+
sudo apt-get install tesseract-ocr libtesseract-dev
84+
85+
# Install Chinese language pack (optional)
86+
sudo apt-get install tesseract-ocr-chi-sim
87+
88+
# macOS
89+
brew install tesseract
90+
91+
# Verify installation
92+
tesseract --version
93+
```
94+
95+
## Installation
96+
97+
```bash
98+
cd examples/knowledge/OCR
99+
go mod tidy
100+
```
101+
102+
## Usage
103+
104+
### Prepare Data
105+
106+
Place PDF files in the `./data` directory (or specify another directory with `--data`):
107+
108+
```bash
109+
mkdir -p ./data
110+
cp /path/to/your/*.pdf ./data/
111+
```
112+
113+
### Configure Environment Variables
114+
115+
Before running the program, ensure you have set the required environment variables:
116+
117+
```bash
118+
# OpenAI API Configuration (Required)
119+
export OPENAI_API_KEY="your-openai-api-key"
120+
export OPENAI_BASE_URL="https://api.openai.com/v1"
121+
122+
# TCVector Configuration (Optional, can also be specified via command-line parameters)
123+
export TCVECTOR_URL="http://your-tcvector-host:port"
124+
export TCVECTOR_USERNAME="your-username"
125+
export TCVECTOR_PASSWORD="your-password"
126+
```
127+
128+
### Basic Usage
129+
130+
```bash
131+
# Ensure OPENAI_API_KEY and OPENAI_BASE_URL are set
132+
go run main.go \
133+
--data=./data \
134+
--tcvector-url=$TCVECTOR_URL \
135+
--tcvector-user=$TCVECTOR_USERNAME \
136+
--tcvector-pass=$TCVECTOR_PASSWORD
137+
```
138+
139+
### Recreate Vector Store
140+
141+
To clear existing data and reload:
142+
```bash
143+
go run main.go \
144+
--data=./data \
145+
--tcvector-url=$TCVECTOR_URL \
146+
--tcvector-user=$TCVECTOR_USERNAME \
147+
--tcvector-pass=$TCVECTOR_PASSWORD \
148+
--recreate
149+
```
150+
151+
### Using the Convenience Script
152+
153+
The project provides a `run_example.sh` script to simplify execution:
154+
155+
```bash
156+
# Edit the script to set environment variables
157+
vim run_example.sh
158+
159+
# Run the example
160+
./run_example.sh
161+
```
162+
163+
## Command-Line Parameters
164+
165+
| Parameter | Description | Default | Environment Variable | Required |
166+
|-----------|-------------|---------|---------------------|----------|
167+
| `--data` | PDF files directory | ./data | - ||
168+
| `--tcvector-url` | TCVector service URL | - | `TCVECTOR_URL` ||
169+
| `--tcvector-user` | TCVector username | - | `TCVECTOR_USERNAME` ||
170+
| `--tcvector-pass` | TCVector password | - | `TCVECTOR_PASSWORD` ||
171+
| `--recreate` | Recreate vector store | true | - ||
172+
173+
### Environment Variables (Required)
174+
175+
| Environment Variable | Description | Example Value | Required |
176+
|---------------------|-------------|---------------|----------|
177+
| `OPENAI_API_KEY` | OpenAI API key | `sk-...` ||
178+
| `OPENAI_BASE_URL` | OpenAI API base URL | `https://api.openai.com/v1` ||
179+
| `TCVECTOR_URL` | TCVector service URL | `http://localhost:8080` | ✅ (or via parameter) |
180+
| `TCVECTOR_USERNAME` | TCVector username | `admin` | ✅ (or via parameter) |
181+
| `TCVECTOR_PASSWORD` | TCVector password | `password` | ✅ (or via parameter) |
182+
183+
## Interactive Commands
184+
185+
After running, the program enters interactive query mode with the following commands:
186+
187+
- **Direct input**: Perform semantic search in PDF content
188+
- **/stats**: Show knowledge base statistics
189+
- **/exit**: Exit the program
190+
191+
## Example Session
192+
193+
```
194+
📄 PDF OCR Knowledge Demo
195+
==============================================================
196+
Data Directory: ./data
197+
Vector Store: TCVector
198+
Collection: pdf-ocr-1
199+
==============================================================
200+
201+
🔧 Setting up knowledge base...
202+
Creating Tesseract OCR engine...
203+
Creating OpenAI embedder...
204+
Creating TCVector store...
205+
Creating directory source for PDFs in /path/to/data...
206+
Creating knowledge base...
207+
208+
📚 Loading PDFs into knowledge base...
209+
Progress: 100% | Time: 15.2s | Docs: 25
210+
✅ Knowledge base loaded successfully in 15.2s
211+
212+
🔍 PDF Search Interface
213+
==============================================================
214+
💡 Commands:
215+
/exit - Exit the program
216+
/stats - Show knowledge base statistics
217+
218+
🎯 Try searching for content in your PDF:
219+
- Enter any keywords or questions
220+
- Search results will show matching text chunks
221+
222+
🔍 Query: What is machine learning?
223+
224+
🔎 Searching for: "What is machine learning?"
225+
⏱️ Search completed in 234ms
226+
📊 Found 5 results:
227+
-------------------------------------------------------------
228+
229+
📄 Result #1 (Score: 0.8542)
230+
Source: research_paper.pdf
231+
Metadata: type=pdf, ocr_enabled=true, chunk_index=3
232+
Content: Machine learning is a subset of artificial intelligence
233+
that enables computers to learn from data without being explicitly
234+
programmed. It involves algorithms that can identify patterns...
235+
236+
📄 Result #2 (Score: 0.7891)
237+
Source: research_paper.pdf
238+
Metadata: type=pdf, ocr_enabled=true, chunk_index=7
239+
Content: Deep learning, a branch of machine learning, uses neural
240+
networks with multiple layers to process complex data...
241+
242+
🔍 Query: /stats
243+
244+
📊 Knowledge Base Statistics
245+
-------------------------------------------------------------
246+
Total Documents: 25
247+
OCR-Processed: 25
248+
Total Characters: 45623
249+
Avg Chars/Doc: 1825
250+
Vector Store: TCVector
251+
Collection: pdf-ocr-1
252+
253+
🔍 Query: /exit
254+
👋 Goodbye!
255+
```
256+
257+
## Workflow
258+
259+
1. **PDF Loading**: Read specified PDF files
260+
2. **OCR Processing**:
261+
- Extract text layer content from PDF
262+
- Perform OCR on embedded images
263+
- Merge text and OCR results
264+
- Mark OCR content with `[OCR Image - Page X, Image Y]` tags
265+
3. **Document Chunking**: Split long documents into appropriately sized chunks
266+
4. **Vectorization**: Convert text to vectors using OpenAI Embedding API
267+
5. **Storage**: Store document vectors in TCVector
268+
6. **Retrieval**: Perform semantic search based on queries and return relevant results
269+
270+
## Technical Architecture
271+
272+
```
273+
┌─────────────┐
274+
│ PDF File │
275+
└──────┬──────┘
276+
277+
278+
┌─────────────────────┐
279+
│ PDF Reader │
280+
│ - Text Extraction │
281+
│ - Image Extraction │
282+
└──────┬──────────────┘
283+
284+
285+
┌─────────────────────┐
286+
│ OCR Engine │
287+
│ - Tesseract │
288+
└──────┬──────────────┘
289+
290+
291+
┌─────────────────────┐
292+
│ Chunking │
293+
│ - Fixed Size │
294+
└──────┬──────────────┘
295+
296+
297+
┌─────────────────────┐
298+
│ Embedder │
299+
│ - OpenAI API │
300+
└──────┬──────────────┘
301+
302+
303+
┌─────────────────────┐
304+
│ TCVector Store │
305+
│ - Vector Storage │
306+
│ - Similarity Search│
307+
└─────────────────────┘
308+
```
309+
310+
## Performance Optimization Tips
311+
312+
1. **Document Chunking**:
313+
- Default chunk size: 1024 tokens
314+
- Adjust based on your use case
315+
316+
2. **Batch Processing**:
317+
- Use batch loading for multiple PDFs
318+
- Set appropriate concurrency levels
319+
320+
3. **Caching Strategy**:
321+
- Use `--recreate=false` to avoid reloading
322+
- TCVector automatically caches vectors
323+
324+
4. **OCR Quality**:
325+
- Ensure PDF images have high resolution
326+
- Install appropriate language packs for Tesseract
327+
- Adjust confidence threshold if needed
328+
329+
### Use Other Vector Stores
330+
331+
Replace TCVector with other supported vector stores:
332+
- InMemory (development/testing)
333+
- PGVector (PostgreSQL)
334+
- Elasticsearch
335+
336+
337+
## Related Documentation
338+
339+
- [Knowledge Module Documentation](../../../knowledge/README.md)
340+
- [OCR Module Documentation](../../../knowledge/ocr/README.md)
341+
- [PDF Reader Documentation](../../../knowledge/document/reader/pdf/README.md)
342+
- [TCVector Integration Documentation](../../../knowledge/vectorstore/tcvector/README.md)
565 KB
Binary file not shown.

0 commit comments

Comments
 (0)