Skip to content

Commit 5abd886

Browse files
authored
Merge pull request #1 from clstaudt/extraction-update
Extraction-update
2 parents a0bfbe5 + 983eb6b commit 5abd886

15 files changed

+1483
-296
lines changed

DEVELOPMENT_INSTRUCTIONS.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@
44
- Make sure there are concise and up to date docstrings that document usage.
55
- Debug information belongs into the command line logs, not in the app UI/UX.
66
- Always develop a generic solution, do not use content from specific examples in the code
7-
- Never include content from example documents in the source code. Never leak content from provided examples into test code!
7+
- Never include content from example documents in the source code. Never leak content from provided examples into test code!
8+
- If you create new .py files for testing or debugging, place them in the experiments folder. Delete them after they are no longer useful. If they yield meaningful unit tests, integrate them into the test suite.

PDF_EXTRACTION_SUMMARY.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# PDF Extraction Solution Summary
2+
3+
## The Problem
4+
The original PDF text extraction was basic and lost all document structure, making it poor for RAG applications.
5+
6+
## The Solution
7+
We implemented **PyMuPDF4LLM** as the primary extraction method because research shows it's:
8+
9+
-**Fastest**: 15 seconds vs 2m30s for alternatives
10+
-**Most Accurate**: Purpose-built for LLM/RAG applications
11+
-**Best Structure Detection**: Automatically handles headers, tables, lists, formatting
12+
-**Completely Local**: No external service calls
13+
-**Lightweight**: Single dependency, no complex setup
14+
15+
## Why Not Multiple Methods?
16+
17+
Initially, we considered offering multiple extraction methods (PyMuPDF4LLM + Marker), but research revealed:
18+
19+
- **PyMuPDF4LLM consistently outperforms alternatives** in speed and accuracy
20+
- **Marker is slower and more complex** without providing better results for our use case
21+
- **One excellent tool is better than multiple mediocre options**
22+
23+
## What We Removed
24+
25+
- ❌ Complex regex patterns (libraries handle this automatically)
26+
- ❌ Custom font-size analysis (PyMuPDF4LLM does this better)
27+
- ❌ Manual heading detection (redundant)
28+
- ❌ Marker dependency (slower, less accurate)
29+
- ❌ Custom post-processing (PyMuPDF4LLM output is already clean)
30+
31+
## Final Architecture
32+
33+
```python
34+
def extract_full_text(self) -> str:
35+
# Just use PyMuPDF4LLM - it handles everything!
36+
return self.extract_high_quality_markdown()
37+
```
38+
39+
**That's it!** No regex patterns, no custom logic, no multiple methods. Just the best tool for the job.
40+
41+
## Key Insight
42+
43+
**Use specialized tools for specialized tasks.** PyMuPDF4LLM was specifically designed for converting PDFs to markdown for LLM applications. It does this one thing exceptionally well, making custom solutions unnecessary.

README.md

Lines changed: 138 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,155 @@
1-
# Ragnarok - PDF Chat with Local LLM
1+
# Ragnarok - Enhanced PDF Processing
22

3-
A Streamlit-based application for chatting with PDF documents using local Large Language Models via Ollama. Features intelligent citation highlighting and multi-document chat sessions.
3+
A powerful PDF processing system with high-quality text extraction and structure preservation, optimized for LLM/RAG applications.
4+
5+
## Key Features
6+
7+
- **High-Quality Text Extraction**: Uses PyMuPDF4LLM for superior structure preservation
8+
- **Automatic Structure Detection**: Headers, tables, lists, and formatting automatically detected
9+
- **LLM/RAG Optimized**: Specifically designed for AI applications
10+
- **Local Processing**: All processing happens locally, no external service calls
11+
- **Citation Highlighting**: Smart PDF highlighting for AI-generated citations
12+
13+
## PDF Extraction Capabilities
14+
15+
The system uses **PyMuPDF4LLM** as the primary extraction method because it:
16+
17+
-**Automatically detects document structure** (headers, tables, lists)
18+
-**Preserves formatting** (bold, italic, etc.)
19+
-**Optimized for LLM applications**
20+
-**Fast and reliable** (15s vs 2m30s compared to alternatives)
21+
-**Local processing only**
22+
23+
### What You Get
24+
25+
- **Structured Markdown Output**: Headers marked with `#`, tables preserved, lists formatted
26+
- **Section Extraction**: Automatic document section detection
27+
- **Table of Contents**: Generated from document structure
28+
- **Citation Highlighting**: AI responses can highlight source text in PDFs
29+
30+
## Installation
31+
32+
1. **Install Dependencies**:
33+
```bash
34+
pip install -r requirements.txt
35+
```
36+
37+
2. **Or use Conda**:
38+
```bash
39+
conda env create -f environment.yml
40+
conda activate ragnarok
41+
```
442

543
## Quick Start
644

7-
### Prerequisites
8-
- **Python 3.8+**
9-
- **Ollama**: Install from [https://ollama.ai](https://ollama.ai)
45+
### Basic Usage
46+
47+
```python
48+
from ragnarok.enhanced_pdf_processor import EnhancedPDFProcessor
49+
50+
# Load PDF
51+
with open('document.pdf', 'rb') as f:
52+
pdf_bytes = f.read()
53+
54+
# Create processor
55+
processor = EnhancedPDFProcessor(pdf_bytes)
56+
57+
# Extract structured text
58+
structured_text = processor.extract_full_text()
59+
print(structured_text) # Markdown with headers, tables, lists
60+
61+
# Get document sections
62+
sections = processor.extract_sections()
63+
for section_name, content in sections.items():
64+
print(f"## {section_name}")
65+
print(content[:200] + "...")
66+
```
67+
68+
### Test the Extraction
1069

11-
### Local Setup
70+
Run the demo script to see the extraction in action:
1271

13-
1. **Install dependencies**
14-
```bash
15-
pip install -r requirements.txt
16-
```
72+
```bash
73+
python simplified_extraction_demo.py
74+
```
75+
76+
This will:
77+
- Find PDF files in the current directory
78+
- Extract text with full structure preservation
79+
- Show document sections and headers
80+
- Display extraction statistics
81+
82+
## Dependencies
83+
84+
### Core Libraries
85+
- **PyMuPDF4LLM** (>=0.0.5) - High-quality PDF to markdown conversion
86+
- **PyMuPDF** (>=1.23.0) - PDF processing and highlighting
87+
- **Streamlit** - Web interface
88+
- **Loguru** - Logging
89+
90+
### Why PyMuPDF4LLM?
1791

18-
2. **Start Ollama and pull a model**
19-
```bash
20-
ollama serve
21-
ollama pull olmo2:7b # or olmo2:13b for better performance
22-
```
92+
PyMuPDF4LLM was chosen as the primary extraction method because:
93+
94+
1. **Purpose-Built for LLM/RAG**: Specifically designed for AI applications
95+
2. **Superior Structure Detection**: Automatically handles headers, tables, lists
96+
3. **Performance**: Much faster than alternatives (15s vs 2m30s)
97+
4. **Reliability**: Consistent, high-quality output
98+
5. **Local Processing**: No external API calls required
99+
100+
## Example Output
101+
102+
**Before** (basic extraction):
103+
```
104+
Introduction This document describes the new system. Features The system has many features. Performance Tests show good performance.
105+
```
23106

24-
3. **Run the application**
25-
```bash
26-
streamlit run app.py
27-
```
107+
**After** (PyMuPDF4LLM):
108+
```markdown
109+
# Introduction
28110

29-
4. **Open browser** to `http://localhost:8501`
111+
This document describes the new system.
30112

31-
### Docker Setup
113+
## Features
114+
115+
The system has many features:
116+
- Feature 1
117+
- Feature 2
118+
- Feature 3
119+
120+
## Performance
121+
122+
Tests show good performance:
123+
124+
| Metric | Value |
125+
|--------|-------|
126+
| Speed | Fast |
127+
| Memory | Low |
128+
```
129+
130+
## Architecture
131+
132+
The system is built around a single, reliable extraction method:
133+
134+
```
135+
PDF Input → PyMuPDF4LLM → Structured Markdown → Sections/TOC
136+
137+
(fallback if needed)
138+
139+
Basic Text Extraction
140+
```
32141

33-
1. **Start Ollama with Docker-compatible configuration**
34-
```bash
35-
OLLAMA_HOST=0.0.0.0:11434 ollama serve
36-
```
142+
## Contributing
37143

38-
2. **Run with Docker Compose**
39-
```bash
40-
docker-compose up -d --build
41-
```
144+
1. Fork the repository
145+
2. Create a feature branch
146+
3. Make your changes
147+
4. Test with various PDF types
148+
5. Submit a pull request
42149

43-
## Usage
150+
## License
44151

45-
1. **Upload a PDF** using the file uploader
46-
2. **Ask questions** about the document
47-
3. **View citations** highlighted directly in the PDF viewer
48-
4. **Manage multiple chats** via the sidebar
152+
MIT License - see LICENSE file for details.
49153

50154
## Testing
51155

0 commit comments

Comments
 (0)