Skip to content

Commit 453325e

Browse files
committed
+ basic evidence highlighting works
1 parent f69a924 commit 453325e

11 files changed

+2134
-1
lines changed

PDF_HIGHLIGHTING_README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# 🎯 PyMuPDF PDF Highlighting Integration
2+
3+
This enhancement adds intelligent PDF highlighting capabilities to your Ollama chatbot, implementing the techniques described in the PyMuPDF integration document you shared.
4+
5+
## 🌟 Key Features
6+
7+
### 1. **Smart AI-Driven Highlighting**
8+
- Automatically highlights text that the AI references in its responses
9+
- **Shows evidence directly below AI messages** - no more hunting through pages!
10+
- Extracts quoted text from AI responses using regex patterns
11+
- Displays contextual snippets with page thumbnails
12+
13+
### 2. **Contextual Evidence Display**
14+
- **Immediate visual proof**: Evidence appears right below each AI response
15+
- **Text snippets**: Shows relevant excerpts with highlighted terms in bold
16+
- **Page thumbnails**: Small page images showing exactly where the evidence comes from
17+
- **Multi-page support**: Shows evidence from multiple pages if referenced
18+
19+
### 3. **Multiple Display Options**
20+
- **Inline Evidence**: Contextual snippets below each message (primary UX)
21+
- **Interactive PDF Viewer**: Full document with embedded highlights in expander
22+
- **Page-by-Page Images**: Alternative view for detailed examination
23+
- **Chat History**: Previous AI responses also show their evidence when enabled
24+
25+
### 4. **Performance Optimized**
26+
- Efficient text search using PyMuPDF's quad-based highlighting
27+
- Minimal memory footprint with proper resource cleanup
28+
- Fast snippet generation for immediate display
29+
30+
## 📦 Installation
31+
32+
1. **Install PyMuPDF** (if not already installed):
33+
```bash
34+
pip install PyMuPDF
35+
```
36+
37+
2. **Update your requirements.txt**:
38+
```
39+
streamlit
40+
ollama
41+
streamlit-pdf-viewer
42+
PyPDF2
43+
pdfplumber
44+
PyMuPDF # <- Add this line
45+
```
46+
47+
## 🚀 Usage Options
48+
49+
### Option 1: Use the Enhanced App (Recommended)
50+
51+
Replace your current `app.py` with `app_enhanced.py`:
52+
53+
```bash
54+
# Backup your current app
55+
cp app.py app_original.py
56+
57+
# Use the enhanced version
58+
cp app_enhanced.py app.py
59+
60+
# Run with highlighting features
61+
streamlit run app.py
62+
```
63+
64+
### Option 2: Try the Demo First
65+
66+
Test the highlighting capabilities with the demo:
67+
68+
```bash
69+
streamlit run demo_highlighting.py
70+
```
71+
72+
## 🎛️ Configuration Options
73+
74+
### Highlighting Behavior
75+
76+
The enhanced app includes a **Smart Highlighting** toggle in the sidebar:
77+
-**Enabled**: Shows evidence snippets below AI messages + highlights in full document
78+
-**Disabled**: Standard PDF viewer without highlights
79+
80+
### 🎯 **New!** Contextual Evidence Display
81+
82+
When enabled, the app now shows:
83+
84+
1. **Below each AI message**:
85+
- Text snippet with highlighted terms in **bold**
86+
- Page thumbnail showing exact location
87+
- Page number reference
88+
89+
2. **In document expander**:
90+
- Full PDF with highlights (for detailed review)
91+
- Page-by-page view option
92+
- Highlight summary
93+
94+
3. **In chat history**:
95+
- Previous AI responses also show their evidence
96+
- Consistent highlighting across conversation
97+
98+
## 🔧 Technical Implementation
99+
100+
### New Architecture
101+
102+
```
103+
enhanced_pdf_processor.py
104+
├── EnhancedPDFProcessor # Main class for PDF processing
105+
│ ├── get_highlighted_snippets() # NEW: Extract contextual snippets
106+
│ ├── display_highlighted_snippets_below_message() # NEW: Inline evidence display
107+
│ ├── extract_text_with_positions() # Gets text with coordinates
108+
│ ├── search_and_highlight_text() # Creates highlighted PDFs
109+
│ ├── create_ai_response_highlights() # Extracts quotes from AI
110+
│ └── display_highlighted_pdf_in_streamlit() # Full document viewer
111+
├── highlight_ai_referenced_text() # Helper function for full document
112+
└── process_pdf_with_highlighting() # Processor creation
113+
```
114+
115+
## 🎯 How It Works
116+
117+
### 1. **Contextual Evidence** (New Primary UX)
118+
```python
119+
# For each AI message that quotes text:
120+
snippets = processor.get_highlighted_snippets(highlight_terms)
121+
# Shows: text context + page thumbnail + location
122+
processor.display_highlighted_snippets_below_message(ai_response, original_text)
123+
```
124+
125+
### 2. **AI Response Analysis**
126+
```python
127+
# Find quoted text in AI responses
128+
highlight_terms = processor.create_ai_response_highlights(ai_response, document_text)
129+
# Uses regex to find: "quoted text", 'quoted text', `quoted text`
130+
```
131+
132+
### 3. **Visual Evidence Creation**
133+
```python
134+
# Create snippet with context and visual
135+
snippet = {
136+
"term": highlighted_term,
137+
"page": page_number,
138+
"context": surrounding_text_with_bold_term,
139+
"page_image": thumbnail_of_page
140+
}
141+
```
142+
143+
## **User Experience Improvements**
144+
145+
### Before (Problems):
146+
- ❌ Evidence hidden in separate document section
147+
- ❌ Users had to manually find highlighted pages
148+
- ❌ Evidence separated from AI claims
149+
- ❌ Required scrolling and hunting
150+
151+
### After (Solutions):
152+
-**Evidence appears immediately below each AI response**
153+
-**Automatic page identification** - no hunting required
154+
-**Contextual text snippets** with exact quotes highlighted
155+
-**Page thumbnails** show visual proof
156+
-**Consistent across chat history** - all messages show evidence
157+
-**Optional full document view** for detailed analysis
158+
159+
## 🎮 **Example User Flow**
160+
161+
1. **User asks**: "What was his master's degree in?"
162+
163+
2. **AI responds**: "Christian Staudt holds a Master's degree in Computer Science. This is indicated by the mention 'Diplom (→Master's degree)' in his document."
164+
165+
3. **Evidence appears immediately**:
166+
```
167+
🎯 Evidence from Document:
168+
169+
Page 2:
170+
> 2005-2012 Karlsruhe Institute of Technology (KIT), computer science studies
171+
> – subjects: algorithm engineering, software engineering, compiler construction,
172+
> parallel programming, advanced object-oriented programming, physics, sociology
173+
> – **Diplom (→Master's degree)**
174+
175+
[Page 2 thumbnail showing highlighted text]
176+
```
177+
178+
4. **User sees proof instantly** - no scrolling or searching required!
179+
180+
## 💡 Integration Tips
181+
182+
1. **The evidence is now immediate** - users see proof right after AI claims
183+
2. **Page thumbnails provide visual confirmation** of exact location
184+
3. **Full document remains available** in expander for detailed review
185+
4. **Chat history shows evidence** for all previous AI responses
186+
5. **Toggle highlighting** in sidebar to switch between modes
187+
188+
---
189+
190+
This enhancement transforms your PDF chatbot into an **intelligent document analysis system with immediate visual proof** for every AI claim! 🎉
Binary file not shown.

0 commit comments

Comments
 (0)