Skip to content

Commit 2ae6852

Browse files
committed
trying enhanced text extraction
1 parent a0bfbe5 commit 2ae6852

9 files changed

+838
-36
lines changed

PDF_EXTRACTION_SUMMARY.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# PDF Extraction Solution Summary
2+
3+
## The Problem
4+
The original PDF text extraction was basic and lost all document structure, making it poor for RAG applications.
5+
6+
## The Solution
7+
We implemented **PyMuPDF4LLM** as the primary extraction method because research shows it's:
8+
9+
-**Fastest**: 15 seconds vs 2m30s for alternatives
10+
-**Most Accurate**: Purpose-built for LLM/RAG applications
11+
-**Best Structure Detection**: Automatically handles headers, tables, lists, formatting
12+
-**Completely Local**: No external service calls
13+
-**Lightweight**: Single dependency, no complex setup
14+
15+
## Why Not Multiple Methods?
16+
17+
Initially, we considered offering multiple extraction methods (PyMuPDF4LLM + Marker), but research revealed:
18+
19+
- **PyMuPDF4LLM consistently outperforms alternatives** in speed and accuracy
20+
- **Marker is slower and more complex** without providing better results for our use case
21+
- **One excellent tool is better than multiple mediocre options**
22+
23+
## What We Removed
24+
25+
- ❌ Complex regex patterns (libraries handle this automatically)
26+
- ❌ Custom font-size analysis (PyMuPDF4LLM does this better)
27+
- ❌ Manual heading detection (redundant)
28+
- ❌ Marker dependency (slower, less accurate)
29+
- ❌ Custom post-processing (PyMuPDF4LLM output is already clean)
30+
31+
## Final Architecture
32+
33+
```python
34+
def extract_full_text(self) -> str:
35+
# Just use PyMuPDF4LLM - it handles everything!
36+
return self.extract_high_quality_markdown()
37+
```
38+
39+
**That's it!** No regex patterns, no custom logic, no multiple methods. Just the best tool for the job.
40+
41+
## Key Insight
42+
43+
**Use specialized tools for specialized tasks.** PyMuPDF4LLM was specifically designed for converting PDFs to markdown for LLM applications. It does this one thing exceptionally well, making custom solutions unnecessary.

README.md

Lines changed: 138 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,155 @@
1-
# Ragnarok - PDF Chat with Local LLM
1+
# Ragnarok - Enhanced PDF Processing
22

3-
A Streamlit-based application for chatting with PDF documents using local Large Language Models via Ollama. Features intelligent citation highlighting and multi-document chat sessions.
3+
A powerful PDF processing system with high-quality text extraction and structure preservation, optimized for LLM/RAG applications.
4+
5+
## Key Features
6+
7+
- **High-Quality Text Extraction**: Uses PyMuPDF4LLM for superior structure preservation
8+
- **Automatic Structure Detection**: Headers, tables, lists, and formatting automatically detected
9+
- **LLM/RAG Optimized**: Specifically designed for AI applications
10+
- **Local Processing**: All processing happens locally, no external service calls
11+
- **Citation Highlighting**: Smart PDF highlighting for AI-generated citations
12+
13+
## PDF Extraction Capabilities
14+
15+
The system uses **PyMuPDF4LLM** as the primary extraction method because it:
16+
17+
-**Automatically detects document structure** (headers, tables, lists)
18+
-**Preserves formatting** (bold, italic, etc.)
19+
-**Optimized for LLM applications**
20+
-**Fast and reliable** (15s vs 2m30s compared to alternatives)
21+
-**Local processing only**
22+
23+
### What You Get
24+
25+
- **Structured Markdown Output**: Headers marked with `#`, tables preserved, lists formatted
26+
- **Section Extraction**: Automatic document section detection
27+
- **Table of Contents**: Generated from document structure
28+
- **Citation Highlighting**: AI responses can highlight source text in PDFs
29+
30+
## Installation
31+
32+
1. **Install Dependencies**:
33+
```bash
34+
pip install -r requirements.txt
35+
```
36+
37+
2. **Or use Conda**:
38+
```bash
39+
conda env create -f environment.yml
40+
conda activate ragnarok
41+
```
442

543
## Quick Start
644

7-
### Prerequisites
8-
- **Python 3.8+**
9-
- **Ollama**: Install from [https://ollama.ai](https://ollama.ai)
45+
### Basic Usage
46+
47+
```python
48+
from ragnarok.enhanced_pdf_processor import EnhancedPDFProcessor
49+
50+
# Load PDF
51+
with open('document.pdf', 'rb') as f:
52+
pdf_bytes = f.read()
53+
54+
# Create processor
55+
processor = EnhancedPDFProcessor(pdf_bytes)
56+
57+
# Extract structured text
58+
structured_text = processor.extract_full_text()
59+
print(structured_text) # Markdown with headers, tables, lists
60+
61+
# Get document sections
62+
sections = processor.extract_sections()
63+
for section_name, content in sections.items():
64+
print(f"## {section_name}")
65+
print(content[:200] + "...")
66+
```
67+
68+
### Test the Extraction
1069

11-
### Local Setup
70+
Run the demo script to see the extraction in action:
1271

13-
1. **Install dependencies**
14-
```bash
15-
pip install -r requirements.txt
16-
```
72+
```bash
73+
python simplified_extraction_demo.py
74+
```
75+
76+
This will:
77+
- Find PDF files in the current directory
78+
- Extract text with full structure preservation
79+
- Show document sections and headers
80+
- Display extraction statistics
81+
82+
## Dependencies
83+
84+
### Core Libraries
85+
- **PyMuPDF4LLM** (>=0.0.5) - High-quality PDF to markdown conversion
86+
- **PyMuPDF** (>=1.23.0) - PDF processing and highlighting
87+
- **Streamlit** - Web interface
88+
- **Loguru** - Logging
89+
90+
### Why PyMuPDF4LLM?
1791

18-
2. **Start Ollama and pull a model**
19-
```bash
20-
ollama serve
21-
ollama pull olmo2:7b # or olmo2:13b for better performance
22-
```
92+
PyMuPDF4LLM was chosen as the primary extraction method because:
93+
94+
1. **Purpose-Built for LLM/RAG**: Specifically designed for AI applications
95+
2. **Superior Structure Detection**: Automatically handles headers, tables, lists
96+
3. **Performance**: Much faster than alternatives (15s vs 2m30s)
97+
4. **Reliability**: Consistent, high-quality output
98+
5. **Local Processing**: No external API calls required
99+
100+
## Example Output
101+
102+
**Before** (basic extraction):
103+
```
104+
Introduction This document describes the new system. Features The system has many features. Performance Tests show good performance.
105+
```
23106

24-
3. **Run the application**
25-
```bash
26-
streamlit run app.py
27-
```
107+
**After** (PyMuPDF4LLM):
108+
```markdown
109+
# Introduction
28110

29-
4. **Open browser** to `http://localhost:8501`
111+
This document describes the new system.
30112

31-
### Docker Setup
113+
## Features
114+
115+
The system has many features:
116+
- Feature 1
117+
- Feature 2
118+
- Feature 3
119+
120+
## Performance
121+
122+
Tests show good performance:
123+
124+
| Metric | Value |
125+
|--------|-------|
126+
| Speed | Fast |
127+
| Memory | Low |
128+
```
129+
130+
## Architecture
131+
132+
The system is built around a single, reliable extraction method:
133+
134+
```
135+
PDF Input → PyMuPDF4LLM → Structured Markdown → Sections/TOC
136+
137+
(fallback if needed)
138+
139+
Basic Text Extraction
140+
```
32141

33-
1. **Start Ollama with Docker-compatible configuration**
34-
```bash
35-
OLLAMA_HOST=0.0.0.0:11434 ollama serve
36-
```
142+
## Contributing
37143

38-
2. **Run with Docker Compose**
39-
```bash
40-
docker-compose up -d --build
41-
```
144+
1. Fork the repository
145+
2. Create a feature branch
146+
3. Make your changes
147+
4. Test with various PDF types
148+
5. Submit a pull request
42149

43-
## Usage
150+
## License
44151

45-
1. **Upload a PDF** using the file uploader
46-
2. **Ask questions** about the document
47-
3. **View citations** highlighted directly in the PDF viewer
48-
4. **Manage multiple chats** via the sidebar
152+
MIT License - see LICENSE file for details.
49153

50154
## Testing
51155

environment.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,5 @@ dependencies:
1818
- PyMuPDF
1919
- loguru
2020
- tiktoken
21+
- pymupdf4llm
2122
- -e .

0 commit comments

Comments
 (0)