|
| 1 | +# 🎯 Adaptive Speaker Scraper |
| 2 | + |
| 3 | +Intelligent scraper that automatically detects website type and chooses the optimal scraping strategy. |
| 4 | + |
| 5 | +## 🧠 How It Works |
| 6 | + |
| 7 | +The scraper analyzes each website and classifies it into three types: |
| 8 | + |
| 9 | +### 1. **Pure HTML** |
| 10 | +- ✅ All speaker data in HTML text |
| 11 | +- 💰 **Strategy**: `SmartScraperGraph` (cheapest, fastest) |
| 12 | +- 📊 **Detection**: Completeness score ≥ 80% |
| 13 | + |
| 14 | +### 2. **Mixed Content** |
| 15 | +- ✅ Some data in HTML, some in images |
| 16 | +- 💰 **Strategy**: `OmniScraperGraph` (selective image processing) |
| 17 | +- 📊 **Detection**: 30-80% completeness + significant images |
| 18 | +- 🎯 Only processes relevant images (not all) |
| 19 | + |
| 20 | +### 3. **Pure Images** |
| 21 | +- ✅ All data embedded in images/widgets |
| 22 | +- 💰 **Strategy**: `ScreenshotScraperGraph` (full page screenshot) |
| 23 | +- 📊 **Detection**: Completeness score < 30% or no speakers found |
| 24 | +- 🎯 Sends 2 screenshots instead of 40+ individual images |
| 25 | + |
| 26 | +## 🚀 Usage |
| 27 | + |
| 28 | +### Basic Example |
| 29 | + |
| 30 | +```python |
| 31 | +from adaptive_speaker_scraper import scrape_with_optimal_strategy |
| 32 | +from pydantic import BaseModel, Field |
| 33 | +from typing import List |
| 34 | + |
| 35 | +class Speaker(BaseModel): |
| 36 | + full_name: str = Field(default="") |
| 37 | + company: str = Field(default="") |
| 38 | + position: str = Field(default="") |
| 39 | + |
| 40 | +class SpeakerScrapeResult(BaseModel): |
| 41 | + speakers: List[Speaker] = Field(default_factory=list) |
| 42 | + |
| 43 | +config = { |
| 44 | + "llm": { |
| 45 | + "api_key": "your-openai-key", |
| 46 | + "model": "openai/gpt-4o-mini", |
| 47 | + }, |
| 48 | + "verbose": True, |
| 49 | +} |
| 50 | + |
| 51 | +result = scrape_with_optimal_strategy( |
| 52 | + url="https://example.com/speakers", |
| 53 | + prompt="Extract all speakers with their names, companies, and positions", |
| 54 | + config=config, |
| 55 | + schema=SpeakerScrapeResult, |
| 56 | +) |
| 57 | + |
| 58 | +print(f"Strategy used: {result['strategy_used']}") |
| 59 | +print(f"Speakers found: {len(result['data']['speakers'])}") |
| 60 | +``` |
| 61 | + |
| 62 | +### Run Demo |
| 63 | + |
| 64 | +```bash |
| 65 | +python examples/adaptive_speaker_scraper.py |
| 66 | +``` |
| 67 | + |
| 68 | +## 🎛️ Decision Flow |
| 69 | + |
| 70 | +``` |
| 71 | +Start |
| 72 | + ↓ |
| 73 | +Run SmartScraperGraph (fast, cheap) |
| 74 | + ↓ |
| 75 | +Analyze results: |
| 76 | + - Completeness score |
| 77 | + - Number of speakers |
| 78 | + - Number of images |
| 79 | + ↓ |
| 80 | +┌─────────────────────┐ |
| 81 | +│ Completeness ≥ 80%? │ → YES → ✅ Use SmartScraperGraph result |
| 82 | +└─────────────────────┘ |
| 83 | + ↓ NO |
| 84 | +┌─────────────────────────────────┐ |
| 85 | +│ 30-80% complete + many images? │ → YES → 🔄 Re-run with OmniScraperGraph |
| 86 | +└─────────────────────────────────┘ |
| 87 | + ↓ NO |
| 88 | +┌──────────────────────────────┐ |
| 89 | +│ Very low data (<30%)? │ → YES → 📸 Use ScreenshotScraperGraph |
| 90 | +└──────────────────────────────┘ |
| 91 | +``` |
| 92 | + |
| 93 | +## 💰 Cost Comparison |
| 94 | + |
| 95 | +### Example: 40 speakers on a page |
| 96 | + |
| 97 | +| Website Type | Strategy | API Calls | Cost (approx) | |
| 98 | +|-------------|----------|-----------|---------------| |
| 99 | +| Pure HTML | SmartScraperGraph | 1-2 text calls | $0.01 | |
| 100 | +| Mixed Content | OmniScraperGraph | 1 text + 20 images | $0.30 | |
| 101 | +| Pure Images | ScreenshotScraperGraph | 1 text + 2 screenshots | $0.05 | |
| 102 | + |
| 103 | +**Without adaptive detection**: Always using OmniScraperGraph with all images would cost **$0.50+** |
| 104 | + |
| 105 | +## 🔧 Customization |
| 106 | + |
| 107 | +### Adjust Detection Thresholds |
| 108 | + |
| 109 | +```python |
| 110 | +# In detect_website_type function: |
| 111 | + |
| 112 | +# More conservative (prefer cheaper strategies) |
| 113 | +if completeness >= 0.7: # Lower from 0.8 |
| 114 | + website_type = WebsiteType.PURE_HTML |
| 115 | + |
| 116 | +# More aggressive image processing |
| 117 | +elif completeness >= 0.5: # Higher from 0.3 |
| 118 | + website_type = WebsiteType.MIXED_CONTENT |
| 119 | +``` |
| 120 | + |
| 121 | +### Control Image Processing |
| 122 | + |
| 123 | +```python |
| 124 | +# In scrape_with_optimal_strategy: |
| 125 | +omni_config["max_images"] = min( |
| 126 | + analysis.get("num_images_detected", 10), |
| 127 | + 20 # Limit to 20 images maximum |
| 128 | +) |
| 129 | +``` |
| 130 | + |
| 131 | +## 📊 Output Format |
| 132 | + |
| 133 | +```json |
| 134 | +{ |
| 135 | + "url": "https://example.com/speakers", |
| 136 | + "website_type": "mixed_content", |
| 137 | + "strategy_used": "OmniScraperGraph", |
| 138 | + "analysis": { |
| 139 | + "completeness_score": 0.45, |
| 140 | + "num_speakers_found": 12, |
| 141 | + "num_images_detected": 24 |
| 142 | + }, |
| 143 | + "data": { |
| 144 | + "event": { ... }, |
| 145 | + "speakers": [ ... ] |
| 146 | + } |
| 147 | +} |
| 148 | +``` |
| 149 | + |
| 150 | +## 🎯 Best Practices |
| 151 | + |
| 152 | +1. **Start with gpt-4o-mini** for initial detection (cheap) |
| 153 | +2. **Upgrade to gpt-4o** if PURE_IMAGES detected (better vision) |
| 154 | +3. **Cache results** to avoid re-analyzing same URLs |
| 155 | +4. **Batch process** multiple URLs to optimize API usage |
| 156 | + |
| 157 | +## 🐛 Troubleshooting |
| 158 | + |
| 159 | +### "Not enough speakers extracted" |
| 160 | +- The page might be PURE_IMAGES but detected as MIXED_CONTENT |
| 161 | +- Solution: Lower the completeness threshold |
| 162 | + |
| 163 | +### "Too expensive" |
| 164 | +- Reduce `max_images` in OmniScraperGraph |
| 165 | +- Or force ScreenshotScraperGraph for image-heavy pages |
| 166 | + |
| 167 | +### "Missing some speakers" |
| 168 | +- Increase `max_images` for MIXED_CONTENT sites |
| 169 | +- Or use scroll/wait options in config for lazy-loaded content |
| 170 | + |
| 171 | +## 📚 Related Examples |
| 172 | + |
| 173 | +- `examples/frontend/batch_speaker_app.py` - Streamlit UI with manual strategy selection |
| 174 | +- `examples/smart_scraper_graph/` - Text-only extraction examples |
| 175 | +- `examples/omni_scraper_graph/` - Image+text extraction examples |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +**Key Advantage**: Automatically balances cost vs accuracy without manual intervention! 🎉 |
0 commit comments