Skip to content

Commit 5e44363

Browse files
Luboslav Yordanovclaude
authored andcommitted
feat: Enhanced adaptive speaker scraper with merged strategy
- Fixed model ID bug (strip openai/ prefix) - Made max_tokens configurable for image extraction - Enhanced screenshot scrolling to capture full pages - Merged SmartScraperGraph + ScreenshotScraperGraph results - Added hallucination filter for fake speakers - Improved prompt to work with OpenAI content policies - Added lazy-load scrolling support (timeout-based) - Created FastAPI backend with web UI - Added Excel export with metadata 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent 739b05a commit 5e44363

File tree

60 files changed

+4964
-52
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+4964
-52
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# 🎯 Adaptive Speaker Scraper
2+
3+
Intelligent scraper that automatically detects website type and chooses the optimal scraping strategy.
4+
5+
## 🧠 How It Works
6+
7+
The scraper analyzes each website and classifies it into three types:
8+
9+
### 1. **Pure HTML**
10+
- ✅ All speaker data in HTML text
11+
- 💰 **Strategy**: `SmartScraperGraph` (cheapest, fastest)
12+
- 📊 **Detection**: Completeness score ≥ 80%
13+
14+
### 2. **Mixed Content**
15+
- ✅ Some data in HTML, some in images
16+
- 💰 **Strategy**: `OmniScraperGraph` (selective image processing)
17+
- 📊 **Detection**: 30-80% completeness + significant images
18+
- 🎯 Only processes relevant images (not all)
19+
20+
### 3. **Pure Images**
21+
- ✅ All data embedded in images/widgets
22+
- 💰 **Strategy**: `ScreenshotScraperGraph` (full page screenshot)
23+
- 📊 **Detection**: Completeness score < 30% or no speakers found
24+
- 🎯 Sends 2 screenshots instead of 40+ individual images
25+
26+
## 🚀 Usage
27+
28+
### Basic Example
29+
30+
```python
31+
from adaptive_speaker_scraper import scrape_with_optimal_strategy
32+
from pydantic import BaseModel, Field
33+
from typing import List
34+
35+
class Speaker(BaseModel):
36+
full_name: str = Field(default="")
37+
company: str = Field(default="")
38+
position: str = Field(default="")
39+
40+
class SpeakerScrapeResult(BaseModel):
41+
speakers: List[Speaker] = Field(default_factory=list)
42+
43+
config = {
44+
"llm": {
45+
"api_key": "your-openai-key",
46+
"model": "openai/gpt-4o-mini",
47+
},
48+
"verbose": True,
49+
}
50+
51+
result = scrape_with_optimal_strategy(
52+
url="https://example.com/speakers",
53+
prompt="Extract all speakers with their names, companies, and positions",
54+
config=config,
55+
schema=SpeakerScrapeResult,
56+
)
57+
58+
print(f"Strategy used: {result['strategy_used']}")
59+
print(f"Speakers found: {len(result['data']['speakers'])}")
60+
```
61+
62+
### Run Demo
63+
64+
```bash
65+
python examples/adaptive_speaker_scraper.py
66+
```
67+
68+
## 🎛️ Decision Flow
69+
70+
```
71+
Start
72+
73+
Run SmartScraperGraph (fast, cheap)
74+
75+
Analyze results:
76+
- Completeness score
77+
- Number of speakers
78+
- Number of images
79+
80+
┌─────────────────────┐
81+
│ Completeness ≥ 80%? │ → YES → ✅ Use SmartScraperGraph result
82+
└─────────────────────┘
83+
↓ NO
84+
┌─────────────────────────────────┐
85+
│ 30-80% complete + many images? │ → YES → 🔄 Re-run with OmniScraperGraph
86+
└─────────────────────────────────┘
87+
↓ NO
88+
┌──────────────────────────────┐
89+
│ Very low data (<30%)? │ → YES → 📸 Use ScreenshotScraperGraph
90+
└──────────────────────────────┘
91+
```
92+
93+
## 💰 Cost Comparison
94+
95+
### Example: 40 speakers on a page
96+
97+
| Website Type | Strategy | API Calls | Cost (approx) |
98+
|-------------|----------|-----------|---------------|
99+
| Pure HTML | SmartScraperGraph | 1-2 text calls | $0.01 |
100+
| Mixed Content | OmniScraperGraph | 1 text + 20 images | $0.30 |
101+
| Pure Images | ScreenshotScraperGraph | 1 text + 2 screenshots | $0.05 |
102+
103+
**Without adaptive detection**: Always using OmniScraperGraph with all images would cost **$0.50+**
104+
105+
## 🔧 Customization
106+
107+
### Adjust Detection Thresholds
108+
109+
```python
110+
# In detect_website_type function:
111+
112+
# More conservative (prefer cheaper strategies)
113+
if completeness >= 0.7: # Lower from 0.8
114+
website_type = WebsiteType.PURE_HTML
115+
116+
# More aggressive image processing
117+
elif completeness >= 0.5: # Higher from 0.3
118+
website_type = WebsiteType.MIXED_CONTENT
119+
```
120+
121+
### Control Image Processing
122+
123+
```python
124+
# In scrape_with_optimal_strategy:
125+
omni_config["max_images"] = min(
126+
analysis.get("num_images_detected", 10),
127+
20 # Limit to 20 images maximum
128+
)
129+
```
130+
131+
## 📊 Output Format
132+
133+
```json
134+
{
135+
"url": "https://example.com/speakers",
136+
"website_type": "mixed_content",
137+
"strategy_used": "OmniScraperGraph",
138+
"analysis": {
139+
"completeness_score": 0.45,
140+
"num_speakers_found": 12,
141+
"num_images_detected": 24
142+
},
143+
"data": {
144+
"event": { ... },
145+
"speakers": [ ... ]
146+
}
147+
}
148+
```
149+
150+
## 🎯 Best Practices
151+
152+
1. **Start with gpt-4o-mini** for initial detection (cheap)
153+
2. **Upgrade to gpt-4o** if PURE_IMAGES detected (better vision)
154+
3. **Cache results** to avoid re-analyzing same URLs
155+
4. **Batch process** multiple URLs to optimize API usage
156+
157+
## 🐛 Troubleshooting
158+
159+
### "Not enough speakers extracted"
160+
- The page might be PURE_IMAGES but detected as MIXED_CONTENT
161+
- Solution: Lower the completeness threshold
162+
163+
### "Too expensive"
164+
- Reduce `max_images` in OmniScraperGraph
165+
- Or force ScreenshotScraperGraph for image-heavy pages
166+
167+
### "Missing some speakers"
168+
- Increase `max_images` for MIXED_CONTENT sites
169+
- Or use scroll/wait options in config for lazy-loaded content
170+
171+
## 📚 Related Examples
172+
173+
- `examples/frontend/batch_speaker_app.py` - Streamlit UI with manual strategy selection
174+
- `examples/smart_scraper_graph/` - Text-only extraction examples
175+
- `examples/omni_scraper_graph/` - Image+text extraction examples
176+
177+
---
178+
179+
**Key Advantage**: Automatically balances cost vs accuracy without manual intervention! 🎉

0 commit comments

Comments
 (0)