A production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl.
- JavaScript-enabled web scraping with Playwright and anti-detection measures
- Document processing with fallback support for various formats
- Concurrent batch processing with configurable limits
- Intelligent caching with SQLite backend
- Rate limiting per domain
- Comprehensive error handling with retry logic
- Easy installation via
uvxor local development setup - Health monitoring and metrics collection
{
"mcpServers": {
"freecrawl": {
"command": "uvx",
"args": ["freecrawl-mcp"],
}
}
}The easiest way to use FreeCrawl is with uvx, which automatically manages dependencies:
# Install browsers on first run
uvx freecrawl-mcp --install-browsers
# Test functionality
uvx freecrawl-mcp --testFor local development or customization:
-
Clone from GitHub:
git clone https://github.com/dylan-gluck/freecrawl-mcp.git cd freecrawl-mcp -
Set up environment:
# Sync dependencies uv sync # Install browser dependencies uv run freecrawl-mcp --install-browsers # Run tests uv run freecrawl-mcp --test
-
Run the server:
uv run freecrawl-mcp
Configure FreeCrawl using environment variables:
# Transport (stdio for MCP, http for REST API)
export FREECRAWL_TRANSPORT=stdio
# Browser pool settings
export FREECRAWL_MAX_BROWSERS=3
export FREECRAWL_HEADLESS=true
# Concurrency limits
export FREECRAWL_MAX_CONCURRENT=10
export FREECRAWL_MAX_PER_DOMAIN=3
# Cache settings
export FREECRAWL_CACHE=true
export FREECRAWL_CACHE_DIR=/tmp/freecrawl_cache
export FREECRAWL_CACHE_TTL=3600
export FREECRAWL_CACHE_SIZE=536870912 # 512MB
# Rate limiting
export FREECRAWL_RATE_LIMIT=60 # requests per minute
# Logging
export FREECRAWL_LOG_LEVEL=INFO# API authentication (optional)
export FREECRAWL_REQUIRE_API_KEY=false
export FREECRAWL_API_KEYS=key1,key2,key3
# Domain blocking
export FREECRAWL_BLOCKED_DOMAINS=localhost,127.0.0.1
# Anti-detection
export FREECRAWL_ANTI_DETECT=true
export FREECRAWL_ROTATE_UA=trueFreeCrawl provides the following MCP tools:
Scrape content from a single URL with advanced options.
Parameters:
url(string): URL to scrapeformats(array): Output formats -["markdown", "html", "text", "screenshot", "structured"]javascript(boolean): Enable JavaScript execution (default: true)wait_for(string, optional): CSS selector or time (ms) to waitanti_bot(boolean): Enable anti-detection measures (default: true)headers(object, optional): Custom HTTP headerscookies(object, optional): Custom cookiescache(boolean): Use cached results if available (default: true)timeout(number): Total timeout in milliseconds (default: 30000)
Example:
{
"name": "freecrawl_scrape",
"arguments": {
"url": "https://example.com",
"formats": ["markdown", "screenshot"],
"javascript": true,
"wait_for": "2000"
}
}Scrape multiple URLs concurrently.
Parameters:
urls(array): List of URLs to scrape (max 100)concurrency(number): Maximum concurrent requests (default: 5)formats(array): Output formats (default:["markdown"])common_options(object, optional): Options applied to all URLscontinue_on_error(boolean): Continue if individual URLs fail (default: true)
Example:
{
"name": "freecrawl_batch_scrape",
"arguments": {
"urls": [
"https://example.com/page1",
"https://example.com/page2"
],
"concurrency": 3,
"formats": ["markdown", "text"]
}
}Extract structured data using schema-driven approach.
Parameters:
url(string): URL to extract data fromschema(object): JSON Schema or Pydantic model definitionprompt(string, optional): Custom extraction instructionsvalidation(boolean): Validate against schema (default: true)multiple(boolean): Extract multiple matching items (default: false)
Example:
{
"name": "freecrawl_extract",
"arguments": {
"url": "https://example.com/product",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}
}Process documents (PDF, DOCX, etc.) with OCR support.
Parameters:
file_path(string, optional): Path to document fileurl(string, optional): URL to download document fromstrategy(string): Processing strategy -"fast","hi_res","ocr_only"(default: "hi_res")formats(array): Output formats -["markdown", "structured", "text"]languages(array, optional): OCR languages (e.g.,["eng", "fra"])extract_images(boolean): Extract embedded images (default: false)extract_tables(boolean): Extract and structure tables (default: true)
Example:
{
"name": "freecrawl_process_document",
"arguments": {
"url": "https://example.com/document.pdf",
"strategy": "hi_res",
"formats": ["markdown", "structured"]
}
}Get server health status and metrics.
Example:
{
"name": "freecrawl_health_check",
"arguments": {}
}Add FreeCrawl to your MCP configuration:
Using uvx (Recommended):
{
"mcpServers": {
"freecrawl": {
"command": "uvx",
"args": ["freecrawl-mcp"]
}
}
}Using local development setup:
{
"mcpServers": {
"freecrawl": {
"command": "uv",
"args": ["run", "freecrawl-mcp"],
"cwd": "/path/to/freecrawl-mcp"
}
}
}Please scrape the content from https://example.com and extract the main article text in markdown format.
Claude Code will automatically use the freecrawl_scrape tool to fetch and process the content.
- Memory: ~100MB base + ~50MB per browser instance
- CPU: Moderate usage during active scraping
- Storage: Cache grows based on configured limits
- Single requests: 2-5 seconds typical response time
- Batch processing: 10-50 concurrent requests depending on configuration
- Cache hit ratio: 30%+ for repeated content
- Enable caching for frequently accessed content
- Adjust concurrency based on target site rate limits
- Use appropriate formats - markdown is faster than screenshots
- Configure rate limiting to avoid being blocked
- Rotating user agents
- Realistic browser fingerprints
- Request timing randomization
- JavaScript execution in sandboxed environment
- URL format validation
- Private IP blocking
- Domain blocklist support
- Request size limits
- Memory usage monitoring
- Browser pool size limits
- Request timeout enforcement
- Rate limiting per domain
| Issue | Possible Cause | Solution |
|---|---|---|
| High memory usage | Too many browser instances | Reduce FREECRAWL_MAX_BROWSERS |
| Slow responses | JavaScript-heavy sites | Increase timeout or disable JS |
| Bot detection | Missing anti-detection | Ensure FREECRAWL_ANTI_DETECT=true |
| Cache misses | TTL too short | Increase FREECRAWL_CACHE_TTL |
| Import errors | Missing dependencies | Run uvx freecrawl-mcp --test |
With uvx:
export FREECRAWL_LOG_LEVEL=DEBUG
uvx freecrawl-mcp --testLocal development:
export FREECRAWL_LOG_LEVEL=DEBUG
uv run freecrawl-mcp --test- Browser pool status
- Memory and CPU usage
- Cache hit rates
- Request success rates
- Response times
FreeCrawl provides structured logging with configurable levels:
- ERROR: Critical failures
- WARNING: Recoverable issues
- INFO: General operations
- DEBUG: Detailed troubleshooting
With uvx:
# Basic functionality test
uvx freecrawl-mcp --testLocal development:
# Basic functionality test
uv run freecrawl-mcp --test- Core server:
FreeCrawlServerclass - Browser management:
BrowserPoolfor resource pooling - Content extraction:
ContentExtractorwith multiple strategies - Caching:
CacheManagerwith SQLite backend - Rate limiting:
RateLimiterwith token bucket algorithm
This project is licensed under the MIT License - see the technical specification for details.
- Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp
- Create a feature branch
- Set up local development:
uv sync - Run tests:
uv run freecrawl-mcp --test - Submit a pull request
For detailed technical information, see ai_docs/FREECRAWL_TECHNICAL_SPEC.md.
FreeCrawl MCP Server - Self-hosted web scraping for the modern web π