- Installation
- Setup Your API Key
- Your First Scrape (Demo)
- Scrape Any URL (No Config Needed)
- Create a Config for a New Website
- Site Config Explained
- Understanding the Excel Output
- Command Reference
- Advanced Options
- Global Configuration
- Troubleshooting
curl -fsSL https://raw.githubusercontent.com/dariusX88/scrapeClaw/main/install.sh | bashgit clone https://github.com/dariusX88/scrapeClaw.git
cd scrapeClaw
python3 -m venv .venv
source .venv/bin/activate
pip install -e .After installing, activate the virtual environment whenever you use ScrapeClaw:
cd scrapeClaw
source .venv/bin/activateOr run commands directly without activating:
.venv/bin/scrapeclaw <command>ScrapeClaw uses Claude AI to intelligently extract data from web pages. You need an Anthropic API key.
- Get your key from console.anthropic.com
- Copy the example env file and add your key:
cp .env.example .env- Open
.envin any text editor and replace the placeholder:
ANTHROPIC_API_KEY=sk-ant-api03-your-actual-key-here
ScrapeClaw ships with a demo config that scrapes books.toscrape.com, a safe practice site.
scrapeclaw scrape example --max-pages 3This will:
- Crawl up to 3 pages of the book catalog
- Send each page to Claude AI to extract: title, price, rating, availability, and URL
- Generate a styled Excel report in
./output/
You should see output like:
╭─ ScrapeClaw — Enterprise Web Scraper ─╮
│ AI-powered extraction with beautiful │
│ Excel output │
╰────────────────────────────────────────╯
Target: https://books.toscrape.com/catalogue/page-1.html
Max pages: 3 | Max entries: 50
Crawling... 5/3 pages (BFS may find extras)
Extracting with Claude... 5/5 pages
Extraction complete: 20 records from 5 pages, 12,345 tokens used
Done! Excel report: output/example_20260227_140000.xlsx
If you just want to see what gets crawled without using Claude (free, no API cost):
scrapeclaw scrape example --max-pages 3 --dry-runFor quick one-off scrapes, use scrape-url — no config file required:
# Extract specific fields from a page
scrapeclaw scrape-url https://example.com/products -f "title,price,description"
# Scrape without Claude (just grabs title/description metadata)
scrapeclaw scrape-url https://example.com --no-claudeThe -f / --fields flag tells Claude what to extract. Use comma-separated field names that describe the data you want. Claude is smart enough to understand what you mean:
# E-commerce product page
scrapeclaw scrape-url https://shop.example.com/item -f "name,price,rating,reviews_count"
# Job listing
scrapeclaw scrape-url https://jobs.example.com/posting -f "job_title,company,salary,location,requirements"
# Real estate listing
scrapeclaw scrape-url https://realty.example.com/house -f "address,price,bedrooms,bathrooms,square_feet"For websites you want to scrape regularly or need to crawl multiple pages, create a site config:
scrapeclaw init-config mysite https://example.com/listingsThis creates config/sites/mysite.yaml.
Open config/sites/mysite.yaml and customize:
name: mysite
base_url: https://example.com
start_url: https://example.com/listings
allowed_domains:
- example.com
max_pages: 30
max_entries: 200
rate_limit:
delay_sec: 2.0
extraction:
fields:
title: "Product or listing title"
price: "Price as a number, no currency symbol"
description: "Short description, max 200 chars"
category: "Product category"
url: "Direct link to the item"The key part is extraction.fields — each entry is:
- Key: the column name that appears in your Excel output
- Value: a plain-English description telling Claude what to look for
scrapeclaw scrape mysitescrapeclaw list-resultsHere's what each field in a site config does:
# Identification
name: mysite # Name used in CLI commands
base_url: https://example.com # Base URL for resolving relative links
# Crawling
start_url: https://example.com/page-1 # Where the crawler starts
allowed_domains: # Only follow links to these domains
- example.com
max_pages: 30 # Stop crawling after this many pages
max_entries: 200 # Stop extracting after this many items
js_required: false # Set true for JavaScript-heavy sites (needs Playwright)
# Politeness
rate_limit:
delay_sec: 2.0 # Wait N seconds between requests (be respectful)
# Optional: CSS selectors (for non-Claude extraction)
css_selectors:
title: h3 a
price: .price_color
# Optional: extra headers
custom_headers:
Accept-Language: en-US,en;q=0.9
# Claude AI extraction
extraction:
fields:
title: "The item's title" # What to extract and how
price: "Price as number, no currency"
url: "Direct link to the item"
max_tokens: 2048 # Max tokens for Claude's response (increase for complex pages)ScrapeClaw uses breadth-first search (BFS):
- Starts at
start_url - Fetches the page, extracts all links
- Filters links to
allowed_domainsonly - Adds new links to the queue (skips already-visited URLs)
- Repeats until
max_pagesis reached or no more links to follow
For each crawled page, ScrapeClaw sends the page text to Claude with your field descriptions. Claude is smart about page types:
- Listing pages (product grids, search results): Claude returns multiple items per page
- Detail pages (single product, single article): Claude returns one item
You don't need to tell it which type — it figures it out automatically.
Each scrape generates an Excel file with 3 worksheets:
- Run date and configuration used
- Total records extracted
- Success rate (% of pages that yielded data)
- Total Claude API tokens consumed
- Key run parameters
- One row per extracted item
- Styled headers with autofilter (click headers to sort/filter)
- Alternating row colors for readability
- Clickable URL hyperlinks
- Error rows highlighted in red
- Frozen header row (stays visible while scrolling)
- Category distribution bar chart (if items have categories)
- Field completeness stats showing which fields were found most often
Output files are saved to ./output/ by default.
Crawl and extract data from a configured site.
scrapeclaw scrape example # Use all defaults from config
scrapeclaw scrape example --max-pages 5 # Override max pages
scrapeclaw scrape example --max-entries 100 # Override max entries
scrapeclaw scrape example --dry-run # Crawl only, no Claude, no Excel
scrapeclaw scrape example --no-claude # Crawl + Excel, but skip Claude AI
scrapeclaw scrape example --json # Also save data as JSON
scrapeclaw scrape example --playwright # Use browser rendering (JS sites)
scrapeclaw scrape example -o report.xlsx # Custom output path
scrapeclaw scrape example -u https://other.com/page # Override start URL| Flag | Short | Description |
|---|---|---|
--url |
-u |
Override the start URL from the config |
--max-pages |
-p |
Max pages to crawl |
--max-entries |
-n |
Max items to extract |
--playwright |
Use Playwright for JS-heavy sites | |
--no-claude |
Skip AI extraction, just grab basic metadata | |
--output |
-o |
Custom output file path |
--dry-run |
Crawl only — no extraction, no output file | |
--json |
Also save results as a .json file |
Scrape a single URL without any config file.
scrapeclaw scrape-url https://example.com -f "title,price,description"
scrapeclaw scrape-url https://example.com --no-claude
scrapeclaw scrape-url https://example.com -f "name,email" -o contacts.xlsx| Flag | Short | Description |
|---|---|---|
--fields |
-f |
Comma-separated fields to extract |
--no-claude |
Skip AI, just grab page metadata | |
--output |
-o |
Custom output file path |
Generate a new site config template.
scrapeclaw init-config shopify https://mystore.com/collections/allShow all available site configurations.
scrapeclaw list-configsShow recent output files with size and date.
scrapeclaw list-results # Show last 20 files
scrapeclaw list-results -n 50 # Show last 50 filesSome websites load content with JavaScript (React, Angular, SPAs). The default HTTP fetcher won't see this content. Use Playwright:
# Install Playwright support
pip install playwright
playwright install chromium
# Scrape with browser rendering
scrapeclaw scrape mysite --playwrightOr set js_required: true in your site config to always use Playwright for that site.
Edit config/config.yaml:
scraper:
proxy:
enabled: true
urls:
- http://proxy1.example.com:8080
- http://proxy2.example.com:8080Be respectful to websites. Adjust the delay between requests:
# In your site config
rate_limit:
delay_sec: 3.0 # Wait 3 seconds between requestsOr in the global config to set a default for all sites:
# config/config.yaml
scraper:
rate_limit:
delay_sec: 2.0
max_retries: 3Get results as JSON alongside Excel:
scrapeclaw scrape example --jsonThis creates both output/example_20260227.xlsx and output/example_20260227.json.
Edit config/config.yaml:
claude:
model: claude-haiku-4-5-20251001 # Faster and cheaper
# model: claude-sonnet-4-20250514 # Default — best balanceThe file config/config.yaml controls defaults for all scrapes:
scraper:
rate_limit:
delay_sec: 1.5 # Seconds between requests
max_retries: 3 # Retry failed requests this many times
retry_wait_min: 1.0 # Min wait between retries
retry_wait_max: 5.0 # Max wait between retries
proxy:
enabled: false
urls: []
claude:
model: claude-sonnet-4-20250514
max_tokens: 4096
output:
dir: ./output
excel_theme:
header_color: "1A3A5C" # Dark blue headers
header_font_color: "FFFFFF" # White header text
alt_row_color: "EBF2FA" # Light blue alternating rows
accent_color: "2980B9" # Blue accents
border_color: "B0C4D8" # Light border
logging:
level: INFO # DEBUG for verbose outputThese env vars (set in .env or your shell) override config file values:
| Variable | Description |
|---|---|
ANTHROPIC_API_KEY |
Required. Your Claude API key |
SCRAPECLAW_OUTPUT_DIR |
Override output directory |
SCRAPECLAW_LOG_LEVEL |
Override log level (DEBUG, INFO, WARNING, ERROR) |
Make sure your .env file exists and contains your key:
cp .env.example .env
nano .env # or any text editorAdd: ANTHROPIC_API_KEY=sk-ant-api03-your-key-here
- The page might load content via JavaScript. Try
--playwright. - The page might be very large. Claude works with the first ~12,000 characters of text.
- Try being more specific in your field descriptions: instead of
"price", use"Price as a number without currency symbol".
Your system Python is managed by Homebrew/apt. Use the virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install -e .- Increase
rate_limit.delay_secto be more polite (3-5 seconds). - Some sites block automated requests. Try adding custom headers in your site config.
- Use
--playwrightfor a more browser-like request.
Make sure openpyxl is installed (it should be if you ran pip install -e .). The output is .xlsx format compatible with Excel, Google Sheets, and Numbers.
Check the logs/ directory for detailed logs:
ls logs/
cat logs/scrapeclaw_20260227.logSet SCRAPECLAW_LOG_LEVEL=DEBUG in .env for verbose output.