|
1 |
| -# Ragnarok - PDF Chat with Local LLM |
| 1 | +# Ragnarok - Enhanced PDF Processing |
2 | 2 |
|
3 |
| -A Streamlit-based application for chatting with PDF documents using local Large Language Models via Ollama. Features intelligent citation highlighting and multi-document chat sessions. |
| 3 | +A powerful PDF processing system with high-quality text extraction and structure preservation, optimized for LLM/RAG applications. |
| 4 | + |
| 5 | +## Key Features |
| 6 | + |
| 7 | +- **High-Quality Text Extraction**: Uses PyMuPDF4LLM for superior structure preservation |
| 8 | +- **Automatic Structure Detection**: Headers, tables, lists, and formatting automatically detected |
| 9 | +- **LLM/RAG Optimized**: Specifically designed for AI applications |
| 10 | +- **Local Processing**: All processing happens locally, no external service calls |
| 11 | +- **Citation Highlighting**: Smart PDF highlighting for AI-generated citations |
| 12 | + |
| 13 | +## PDF Extraction Capabilities |
| 14 | + |
| 15 | +The system uses **PyMuPDF4LLM** as the primary extraction method because it: |
| 16 | + |
| 17 | +- ✅ **Automatically detects document structure** (headers, tables, lists) |
| 18 | +- ✅ **Preserves formatting** (bold, italic, etc.) |
| 19 | +- ✅ **Optimized for LLM applications** |
| 20 | +- ✅ **Fast and reliable** (15s vs 2m30s compared to alternatives) |
| 21 | +- ✅ **Local processing only** |
| 22 | + |
| 23 | +### What You Get |
| 24 | + |
| 25 | +- **Structured Markdown Output**: Headers marked with `#`, tables preserved, lists formatted |
| 26 | +- **Section Extraction**: Automatic document section detection |
| 27 | +- **Table of Contents**: Generated from document structure |
| 28 | +- **Citation Highlighting**: AI responses can highlight source text in PDFs |
| 29 | + |
| 30 | +## Installation |
| 31 | + |
| 32 | +1. **Install Dependencies**: |
| 33 | +```bash |
| 34 | +pip install -r requirements.txt |
| 35 | +``` |
| 36 | + |
| 37 | +2. **Or use Conda**: |
| 38 | +```bash |
| 39 | +conda env create -f environment.yml |
| 40 | +conda activate ragnarok |
| 41 | +``` |
4 | 42 |
|
5 | 43 | ## Quick Start
|
6 | 44 |
|
7 |
| -### Prerequisites |
8 |
| -- **Python 3.8+** |
9 |
| -- **Ollama**: Install from [https://ollama.ai](https://ollama.ai) |
| 45 | +### Basic Usage |
| 46 | + |
| 47 | +```python |
| 48 | +from ragnarok.enhanced_pdf_processor import EnhancedPDFProcessor |
| 49 | + |
| 50 | +# Load PDF |
| 51 | +with open('document.pdf', 'rb') as f: |
| 52 | + pdf_bytes = f.read() |
| 53 | + |
| 54 | +# Create processor |
| 55 | +processor = EnhancedPDFProcessor(pdf_bytes) |
| 56 | + |
| 57 | +# Extract structured text |
| 58 | +structured_text = processor.extract_full_text() |
| 59 | +print(structured_text) # Markdown with headers, tables, lists |
| 60 | + |
| 61 | +# Get document sections |
| 62 | +sections = processor.extract_sections() |
| 63 | +for section_name, content in sections.items(): |
| 64 | + print(f"## {section_name}") |
| 65 | + print(content[:200] + "...") |
| 66 | +``` |
| 67 | + |
| 68 | +### Test the Extraction |
10 | 69 |
|
11 |
| -### Local Setup |
| 70 | +Run the demo script to see the extraction in action: |
12 | 71 |
|
13 |
| -1. **Install dependencies** |
14 |
| - ```bash |
15 |
| - pip install -r requirements.txt |
16 |
| - ``` |
| 72 | +```bash |
| 73 | +python simplified_extraction_demo.py |
| 74 | +``` |
| 75 | + |
| 76 | +This will: |
| 77 | +- Find PDF files in the current directory |
| 78 | +- Extract text with full structure preservation |
| 79 | +- Show document sections and headers |
| 80 | +- Display extraction statistics |
| 81 | + |
| 82 | +## Dependencies |
| 83 | + |
| 84 | +### Core Libraries |
| 85 | +- **PyMuPDF4LLM** (>=0.0.5) - High-quality PDF to markdown conversion |
| 86 | +- **PyMuPDF** (>=1.23.0) - PDF processing and highlighting |
| 87 | +- **Streamlit** - Web interface |
| 88 | +- **Loguru** - Logging |
| 89 | + |
| 90 | +### Why PyMuPDF4LLM? |
17 | 91 |
|
18 |
| -2. **Start Ollama and pull a model** |
19 |
| - ```bash |
20 |
| - ollama serve |
21 |
| - ollama pull olmo2:7b # or olmo2:13b for better performance |
22 |
| - ``` |
| 92 | +PyMuPDF4LLM was chosen as the primary extraction method because: |
| 93 | + |
| 94 | +1. **Purpose-Built for LLM/RAG**: Specifically designed for AI applications |
| 95 | +2. **Superior Structure Detection**: Automatically handles headers, tables, lists |
| 96 | +3. **Performance**: Much faster than alternatives (15s vs 2m30s) |
| 97 | +4. **Reliability**: Consistent, high-quality output |
| 98 | +5. **Local Processing**: No external API calls required |
| 99 | + |
| 100 | +## Example Output |
| 101 | + |
| 102 | +**Before** (basic extraction): |
| 103 | +``` |
| 104 | +Introduction This document describes the new system. Features The system has many features. Performance Tests show good performance. |
| 105 | +``` |
23 | 106 |
|
24 |
| -3. **Run the application** |
25 |
| - ```bash |
26 |
| - streamlit run app.py |
27 |
| - ``` |
| 107 | +**After** (PyMuPDF4LLM): |
| 108 | +```markdown |
| 109 | +# Introduction |
28 | 110 |
|
29 |
| -4. **Open browser** to `http://localhost:8501` |
| 111 | +This document describes the new system. |
30 | 112 |
|
31 |
| -### Docker Setup |
| 113 | +## Features |
| 114 | + |
| 115 | +The system has many features: |
| 116 | +- Feature 1 |
| 117 | +- Feature 2 |
| 118 | +- Feature 3 |
| 119 | + |
| 120 | +## Performance |
| 121 | + |
| 122 | +Tests show good performance: |
| 123 | + |
| 124 | +| Metric | Value | |
| 125 | +|--------|-------| |
| 126 | +| Speed | Fast | |
| 127 | +| Memory | Low | |
| 128 | +``` |
| 129 | + |
| 130 | +## Architecture |
| 131 | + |
| 132 | +The system is built around a single, reliable extraction method: |
| 133 | + |
| 134 | +``` |
| 135 | +PDF Input → PyMuPDF4LLM → Structured Markdown → Sections/TOC |
| 136 | + ↓ |
| 137 | + (fallback if needed) |
| 138 | + ↓ |
| 139 | + Basic Text Extraction |
| 140 | +``` |
32 | 141 |
|
33 |
| -1. **Start Ollama with Docker-compatible configuration** |
34 |
| - ```bash |
35 |
| - OLLAMA_HOST=0.0.0.0:11434 ollama serve |
36 |
| - ``` |
| 142 | +## Contributing |
37 | 143 |
|
38 |
| -2. **Run with Docker Compose** |
39 |
| - ```bash |
40 |
| - docker-compose up -d --build |
41 |
| - ``` |
| 144 | +1. Fork the repository |
| 145 | +2. Create a feature branch |
| 146 | +3. Make your changes |
| 147 | +4. Test with various PDF types |
| 148 | +5. Submit a pull request |
42 | 149 |
|
43 |
| -## Usage |
| 150 | +## License |
44 | 151 |
|
45 |
| -1. **Upload a PDF** using the file uploader |
46 |
| -2. **Ask questions** about the document |
47 |
| -3. **View citations** highlighted directly in the PDF viewer |
48 |
| -4. **Manage multiple chats** via the sidebar |
| 152 | +MIT License - see LICENSE file for details. |
49 | 153 |
|
50 | 154 | ## Testing
|
51 | 155 |
|
|
0 commit comments