AI-Powered Web Content Extraction - Turn any website into structured data using Large Language Models
Ever wanted to extract meaningful information from websites but got tired of parsing HTML and dealing with messy data? LLM WebExtract combines modern web scraping with Large Language Models to intelligently extract structured information from any webpage.
Instead of writing complex parsing rules for every website, this tool:
- π Scrapes webpages using Playwright (handles modern JavaScript sites)
- π§ Feeds content to AI (local via Ollama, or cloud via OpenAI/Anthropic)
- π Returns structured data - topics, entities, summaries, key facts, and more
Think of it as having an AI assistant that reads web pages and summarizes them for you.
- π Multi-Provider Support: Works with Ollama (local), OpenAI, and Anthropic
- π Modern Web Scraping: Handles JavaScript-heavy sites with Playwright
- π Pre-built Profiles: Ready configurations for news, research, e-commerce
- π‘οΈ Robust Error Handling: Specific exceptions for different failure types
- β‘ Batch Processing: Extract from multiple URLs concurrently
- ποΈ Flexible Configuration: Environment variables, custom prompts, schemas
- πΎ Smart Caching: Avoid re-processing the same URLs
# Basic installation
pip install llm-webextract
playwright install chromium
# With cloud providers
pip install llm-webextract[openai] # For GPT models
pip install llm-webextract[anthropic] # For Claude models
pip install llm-webextract[all] # Everything# Command line (requires local Ollama)
llm-webextract extract "https://news.ycombinator.com"
# Test your setup
llm-webextract test# Python - Local Ollama
import webextract
result = webextract.quick_extract("https://techcrunch.com")
print(f"Summary: {result.get_summary()}")
print(f"Topics: {result.get_topics()}")
# Or use the dedicated Ollama function
result = webextract.extract_with_ollama("https://techcrunch.com", model="llama3.2")from webextract import WebExtractor, ConfigBuilder, extract_with_ollama
# Using ConfigBuilder
extractor = WebExtractor(
ConfigBuilder()
.with_ollama("llama3.2") # or any model you have
.build()
)
result = extractor.extract("https://example.com")
# Quick one-liner
result = extract_with_ollama("https://example.com", model="llama3.2")from webextract import extract_with_openai
# Quick one-liner
result = extract_with_openai("https://example.com", api_key="sk-...", model="gpt-4o-mini")
# Using ConfigBuilder
extractor = WebExtractor(
ConfigBuilder()
.with_openai(api_key="sk-...", model="gpt-4o-mini")
.build()
)from webextract import extract_with_anthropic
# Quick one-liner
result = extract_with_anthropic("https://example.com", api_key="sk-ant-...", model="claude-3-5-sonnet-20241022")
# Using ConfigBuilder
extractor = WebExtractor(
ConfigBuilder()
.with_anthropic(api_key="sk-ant-...", model="claude-3-5-sonnet-20241022")
.build()
)from webextract import ConfigProfiles, WebExtractor
# Optimized for different content types
news_extractor = WebExtractor(ConfigProfiles.news_scraping())
research_extractor = WebExtractor(ConfigProfiles.research_papers())
shop_extractor = WebExtractor(ConfigProfiles.ecommerce())Set defaults to avoid repeating configuration:
export WEBEXTRACT_LLM_PROVIDER="openai"
export WEBEXTRACT_MODEL="gpt-4o-mini"
export WEBEXTRACT_API_KEY="sk-your-key"
export WEBEXTRACT_MAX_CONTENT="8000"
export WEBEXTRACT_REQUEST_TIMEOUT="45"The AI analyzes content and returns structured data:
{
"summary": "Article discusses the latest developments in AI technology...",
"topics": ["artificial intelligence", "machine learning", "tech industry"],
"entities": {
"people": ["Sam Altman", "Satya Nadella"],
"organizations": ["OpenAI", "Microsoft", "Google"],
"locations": ["San Francisco", "Silicon Valley"]
},
"sentiment": "positive",
"key_facts": [
"New model shows 40% improvement in reasoning",
"Beta testing starts next month",
"Open source version planned for 2024"
],
"category": "technology",
"important_dates": ["2024-03-15", "Q2 2024"],
"statistics": ["40% improvement", "$10B investment"],
"confidence": 0.89
}schema = {
"product_name": "Extract the main product name",
"price": "Extract the current price",
"rating": "Extract average rating (number only)",
"reviews_count": "Extract total number of reviews",
"key_features": "List main product features"
}
result = extractor.extract_with_custom_schema(
"https://amazon.com/product/...",
schema
)urls = [
"https://techcrunch.com/article1",
"https://venturebeat.com/article2",
"https://theverge.com/article3"
]
results = extractor.extract_batch(urls, max_workers=3)
for result in results:
if result and result.is_successful:
print(f"{result.url}: {result.get_summary()}")from webextract import (
WebExtractor,
ExtractionError,
ScrapingError,
LLMError,
AuthenticationError
)
try:
result = extractor.extract("https://problematic-site.com")
except AuthenticationError:
print("Invalid API key")
except ScrapingError as e:
print(f"Failed to scrape website: {e}")
except LLMError as e:
print(f"AI processing failed: {e}")
except ExtractionError as e:
print(f"General extraction error: {e}")config = (ConfigBuilder()
.with_openai("sk-...", "gpt-4")
.with_custom_prompt("""
Focus on extracting:
1. Financial metrics and numbers
2. Company performance indicators
3. Market trends and predictions
4. Executive quotes and statements
""")
.build())graph LR
A[URL] --> B[Playwright Scraper]
B --> C[Content Cleaning]
C --> D[LLM Processing]
D --> E[Structured Data]
B --> F[JavaScript Handling]
C --> G[Ad/Nav Removal]
D --> H[JSON Validation]
E --> I[Confidence Scoring]
- Modern Web Scraping: Playwright handles JavaScript, SPAs, and modern websites
- Intelligent Content Processing: Removes ads, navigation, focuses on main content
- AI Analysis: Your chosen LLM extracts structured information
- Quality Assurance: Validates output format and calculates confidence scores
- Python 3.8+
- One of:
- Ollama running locally (free, private)
- OpenAI API key (paid, powerful)
- Anthropic API key (paid, great reasoning)
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull llama3.2
# Start the service
ollama serve- π° News Monitoring: Extract key information from news articles
- π¬ Research: Process academic papers and technical documents
- π E-commerce: Monitor product prices, reviews, specifications
- π Market Research: Analyze competitor websites and industry trends
- π Content Curation: Summarize and categorize web content
- π€ AI Training: Generate structured datasets from web content
# Test connection and model availability
llm-webextract test
# Test with a specific URL
llm-webextract extract "https://example.com" --format pretty
# Check available providers
python -c "
from webextract.core.llm_factory import get_available_providers
import json
print(json.dumps(get_available_providers(), indent=2))
"We welcome contributions! Here's how to get started:
- π Read our Development Guide for commit conventions and processes
- π Report bugs by opening an issue with detailed reproduction steps
- π‘ Suggest features through GitHub discussions
- π§ Submit PRs following our coding standards
# Fork and clone
git clone https://github.com/HimashaHerath/webextract.git
cd webextract
# Install in development mode
pip install -e ".[dev]"
# Run tests and quality checks
python -m pytest
python -m black --check .
python -m flake8 --config .flake8"Model not available"
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Pull the model if missing
ollama pull llama3.2"Connection refused"
- Ensure Ollama is running:
ollama serve - Check firewall settings
- Verify the base URL in configuration
"Rate limit exceeded"
- Add delays between requests
- Use batch processing with lower concurrency
- Check your API plan limits
"Content too short"
- Site might be blocking scrapers
- Try different user agents
- Check if site requires JavaScript (we handle this)
MIT License - feel free to use this in your projects!
Built with these amazing tools:
- Ollama - Local LLM inference
- Playwright - Modern web scraping
- Beautiful Soup - HTML parsing
- Pydantic - Data validation
- Typer - CLI framework
- π« Email: himasha626@gmail.com
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
Got questions? Open an issue - I'm happy to help! Find this useful? Give it a β - it really helps!