LLM Scraper Python is a Python library that allows you to extract structured data from any webpage using LLMs. This is a Python port of the popular TypeScript LLM Scraper.
Important
This is a Python implementation of the original TypeScript LLM Scraper library, providing the same powerful functionality with Python-native APIs and both sync/async support.
Tip
Under the hood, it uses structured output generation to convert pages to structured data. You can find more about this approach here.
- Dual API Support: Both synchronous and asynchronous operations
- OpenAI Integration: Built-in support for OpenAI GPT models with structured outputs
- Extensible: Protocol-based design allows custom LLM providers
- Schema Flexibility: Supports both Pydantic models and JSON Schema
- Type Safety: Full type-safety with Python type hints
- Playwright Integration: Built on the robust Playwright framework
- Multiple Formats: 6 content processing modes including image support
- Code Generation: Generate reusable JavaScript extraction code
- Error Handling: Comprehensive validation and error reporting
Supported Content Formats:
html- Pre-processed HTML (cleaned, scripts/styles removed)raw_html- Raw HTML (no processing)markdown- HTML converted to markdowntext- Extracted readable text (using Readability.js)image- Page screenshot for multi-modal modelscustom- User-defined extraction function
Make sure to give the original project a star! ⭐️
- Install the package and dependencies:
pip install llm_scraper_py- Install Playwright browsers:
playwright installimport os
from llm_scraper_py import LLMScraper, OpenAIModel
# Initialize OpenAI model
llm = OpenAIModel(
model="gpt-4o",
api_key=os.getenv("OPENAI_API_KEY") # or pass directly
)
# Create scraper instance
scraper = LLMScraper(llm)Extract top stories from Hacker News using async/await:
import asyncio
from playwright.async_api import async_playwright
from pydantic import BaseModel, Field
from typing import List
from llm_scraper_py import LLMScraper, OpenAIModel
# Define the data structure using Pydantic
class Story(BaseModel):
title: str
points: int
by: str
comments_url: str = Field(alias="commentsURL")
class HackerNewsData(BaseModel):
top: List[Story] = Field(
max_length=5,
description="Top 5 stories on Hacker News"
)
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
# Initialize LLM and scraper
llm = OpenAIModel(model="gpt-4o")
scraper = LLMScraper(llm)
# Navigate and scrape
await page.goto("https://news.ycombinator.com")
result = await scraper.arun(page, HackerNewsData, {"format": "html"})
# Display results
print("Top Stories:")
for story in result.data["top"]:
print(f"- {story['title']} ({story['points']} points by {story['by']})")
await browser.close()
asyncio.run(main())For simpler use cases, use the synchronous API:
from playwright.sync_api import sync_playwright
from pydantic import BaseModel
from llm_scraper_py import LLMScraper, OpenAIModel
class ArticleData(BaseModel):
title: str
content: str
author: str = None
def main():
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Initialize LLM and scraper
llm = OpenAIModel(model="gpt-4o-mini")
scraper = LLMScraper(llm)
# Navigate and scrape
page.goto("https://example-blog.com/article")
result = scraper.run(page, ArticleData, {"format": "text"})
print(f"Title: {result['data']['title']}")
print(f"Author: {result['data']['author']}")
browser.close()
main()You can use JSON Schema instead of Pydantic models:
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "boolean"},
"features": {
"type": "array",
"items": {"type": "string"},
"maxItems": 5
}
},
"required": ["title", "price"]
}
# Works with both sync and async
result = await scraper.arun(page, schema, {"format": "html"})
# or
result = scraper.run(page, schema, {"format": "html"})Pydantic provides better type safety and validation:
from pydantic import BaseModel, Field
from typing import List, Optional
class Product(BaseModel):
title: str
price: float = Field(gt=0, description="Price must be positive")
availability: bool
features: List[str] = Field(max_length=5)
description: Optional[str] = None
result = await scraper.arun(page, Product)The scraper supports different content processing formats:
# HTML (default) - cleaned HTML with scripts/styles removed
result = await scraper.arun(page, schema, {"format": "html"})
# Raw HTML - unprocessed HTML content
result = await scraper.arun(page, schema, {"format": "raw_html"})
# Markdown - HTML converted to markdown format
result = await scraper.arun(page, schema, {"format": "markdown"})
# Text - extracted readable text using Readability.js
result = await scraper.arun(page, schema, {"format": "text"})
# Image - page screenshot for multi-modal models
result = await scraper.arun(page, schema, {"format": "image"})
# Custom - user-defined extraction function
def extract_custom_data(page):
return page.locator(".main-content").inner_text()
result = await scraper.arun(page, schema, {
"format": "custom",
"formatFunction": extract_custom_data
})Generate reusable JavaScript code for data extraction:
from pydantic import BaseModel
class ProductInfo(BaseModel):
name: str
price: float
rating: float
# Generate extraction code (async)
result = await scraper.agenerate(page, ProductInfo)
generated_code = result.code
# Or synchronous
result = scraper.generate(page, ProductInfo)
generated_code = result.code
# Execute the generated code on any similar page
extracted_data = await page.evaluate(generated_code)
# Validate and use the data
product = ProductInfo.model_validate(extracted_data)
print(f"Product: {product.name}, Price: ${product.price}")The generated code is a self-contained JavaScript function that can be reused across similar pages without additional LLM calls.
Customize the LLM behavior with detailed options:
from llm_scraper_py import ScraperLLMOptions
options = ScraperLLMOptions(
format="html",
prompt="Extract the data carefully and accurately",
temperature=0.1, # Lower = more deterministic
maxTokens=2000, # Response length limit
topP=0.9, # Nucleus sampling
mode="json" # Response format hint
)
result = await scraper.arun(page, schema, options)For code generation, use specialized options:
from llm_scraper_py import ScraperGenerateOptions
gen_options = ScraperGenerateOptions(
format="html",
prompt="Generate efficient extraction code",
temperature=0.2
)
result = await scraper.agenerate(page, schema, gen_options)The library provides comprehensive error handling:
from llm_scraper_py import LLMScraper, OpenAIModel
from pydantic import ValidationError
from playwright.async_api import TimeoutError
try:
llm = OpenAIModel(model="gpt-4o")
scraper = LLMScraper(llm)
result = await scraper.arun(page, schema, {"format": "html"})
except ValidationError as e:
print(f"Schema validation failed: {e}")
except TimeoutError:
print("Page load timeout")
except ValueError as e:
print(f"Configuration error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")Implement custom LLM providers using the LanguageModel protocol:
from llm_scraper_py import LanguageModel, LLMScraper
from typing import Dict, Any, Optional, AsyncGenerator
from pydantic import BaseModel
class CustomLLMProvider:
"""Example custom LLM provider implementation"""
def __init__(self, api_key: str, base_url: str):
self.api_key = api_key
self.base_url = base_url
# Sync methods
def generate_json(
self,
messages: list[dict],
schema: BaseModel,
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
mode: Optional[str] = None,
) -> Dict[str, Any]:
# Implement your JSON generation logic
# Return structured data matching the schema
pass
def generate_text(
self,
messages: list[dict],
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
) -> str:
# Implement your text generation logic
pass
# Async methods
async def agenerate_json(
self,
messages: list[dict],
schema: BaseModel,
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
mode: Optional[str] = None,
) -> Dict[str, Any]:
# Async version of generate_json
pass
async def agenerate_text(
self,
messages: list[dict],
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
) -> str:
# Async version of generate_text
pass
# Streaming methods (optional)
def stream_json(self, *args, **kwargs):
raise NotImplementedError("Streaming not supported")
async def astream_json(self, *args, **kwargs):
raise NotImplementedError("Streaming not supported")
# Use your custom provider
custom_llm = CustomLLMProvider(api_key="your-key", base_url="https://api.example.com")
scraper = LLMScraper(custom_llm)Async Methods (Recommended):
arun(page, schema, options=None)- Extract structured data asynchronouslyagenerate(page, schema, options=None)- Generate extraction code asynchronouslyastream(page, schema, options=None)- Stream partial results (not implemented)
Sync Methods:
run(page, schema, options=None)- Extract structured data synchronouslygenerate(page, schema, options=None)- Generate extraction code synchronouslystream(page, schema, options=None)- Stream partial results (not implemented)
All extraction methods return a dictionary with:
{
"data": {...}, # Extracted data matching your schema
"url": "https://..." # Source page URL
}Generation methods return:
{
"code": "...", # Generated JavaScript code
"url": "https://..." # Source page URL
}pip install llm_scraper_pyCore Dependencies:
playwright- Web automation and browser controlpydantic- Data validation and serializationopenai- OpenAI API client (for built-in OpenAI support)jsonschema- JSON Schema validation
Use async methods for:
- Better performance with multiple concurrent scraping tasks
- Integration with async web frameworks (FastAPI, aiohttp)
- Non-blocking operations in async applications
import asyncio
from playwright.async_api import async_playwright
async def scrape_multiple_pages(urls):
async with async_playwright() as p:
browser = await p.chromium.launch()
tasks = []
for url in urls:
page = await browser.new_page()
await page.goto(url)
task = scraper.arun(page, schema)
tasks.append(task)
results = await asyncio.gather(*tasks)
await browser.close()
return resultsUse sync methods for:
- Simple scripts and one-off tasks
- Integration with sync codebases
- Learning and prototyping
from playwright.sync_api import sync_playwright
def scrape_single_page(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
result = scraper.run(page, schema)
browser.close()
return result- Reuse browser instances when scraping multiple pages
- Use async methods for concurrent operations
- Choose appropriate content formats -
textis fastest,imageis slowest - Set reasonable token limits to control costs and response times
- Use code generation for repeated scraping of similar pages
We welcome contributions! This project is a Python port of the original TypeScript LLM Scraper by mishushakov.
Ways to contribute:
- Report bugs and request features via GitHub issues
- Submit pull requests for improvements
- Add support for new LLM providers
- Improve documentation and examples
- Write tests for edge cases
This project follows the same license as the original LLM Scraper project.