Scraping Guide

Skill Seekers v3.1.4 Complete guide to all scraping options

Overview

Skill Seekers can extract knowledge from four types of sources:

Source	Command	Best For
Documentation	`create <url>`	Web docs, tutorials, API refs
GitHub	`create <repo>`	Source code, issues, releases
PDF	`create <file.pdf>`	Manuals, papers, reports
Local	`create <./path>`	Your projects, internal code

Documentation Scraping

Basic Usage

# Auto-detect and scrape
skill-seekers create https://docs.react.dev/

# With custom name
skill-seekers create https://docs.react.dev/ --name react-docs

# With description
skill-seekers create https://docs.react.dev/ \
  --description "React JavaScript library documentation"

Using Preset Configs

# List available presets
skill-seekers estimate --all

# Use preset
skill-seekers create --config react
skill-seekers create --config django
skill-seekers create --config fastapi

Available presets: See configs/ directory in repository.

Custom Configuration

All configs must use the unified format with a sources array (since v2.11.0):

# Create config file
cat > configs/my-docs.json << 'EOF'
{
  "name": "my-framework",
  "description": "My framework documentation",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://docs.example.com/",
      "max_pages": 200,
      "rate_limit": 0.5,
      "selectors": {
        "main_content": "article",
        "title": "h1"
      },
      "url_patterns": {
        "include": ["/docs/", "/api/"],
        "exclude": ["/blog/", "/search"]
      }
    }
  ]
}
EOF

# Use config
skill-seekers create --config configs/my-docs.json

Note: Omit main_content from selectors to let Skill Seekers auto-detect the best content element (main, article, div[role="main"], etc.).

See Config Format for all options.

Advanced Options

# Limit pages (for testing)
skill-seekers create <url> --max-pages 50

# Adjust rate limit
skill-seekers create <url> --rate-limit 1.0

# Parallel workers (faster)
skill-seekers create <url> --workers 5 --async

# Dry run (preview)
skill-seekers create <url> --dry-run

# Resume interrupted
skill-seekers create <url> --resume

# Fresh start (ignore cache)
skill-seekers create <url> --fresh

GitHub Repository Scraping

Basic Usage

# By repo name
skill-seekers create facebook/react

# With explicit flag
skill-seekers github --repo facebook/react

# With custom name
skill-seekers github --repo facebook/react --name react-source

With GitHub Token

# Set token for higher rate limits
export GITHUB_TOKEN=ghp_...

# Use token
skill-seekers github --repo facebook/react

Benefits of token:

5000 requests/hour vs 60
Access to private repos
Higher GraphQL limits

What Gets Extracted

Data	Default	Flag to Disable
Source code	✅	`--scrape-only`
README	✅	-
Issues	✅	`--no-issues`
Releases	✅	`--no-releases`
Changelog	✅	`--no-changelog`

Control What to Fetch

# Skip issues (faster)
skill-seekers github --repo facebook/react --no-issues

# Limit issues
skill-seekers github --repo facebook/react --max-issues 50

# Scrape only (no build)
skill-seekers github --repo facebook/react --scrape-only

# Non-interactive (CI/CD)
skill-seekers github --repo facebook/react --non-interactive

PDF Extraction

Basic Usage

# Direct file
skill-seekers create manual.pdf --name product-manual

# With explicit command
skill-seekers pdf --pdf manual.pdf --name docs

OCR for Scanned PDFs

# Enable OCR
skill-seekers pdf --pdf scanned.pdf --enable-ocr

Requirements:

pip install skill-seekers[pdf-ocr]
# Also requires: tesseract-ocr (system package)

Password-Protected PDFs

# In config file
{
  "name": "secure-docs",
  "pdf_path": "protected.pdf",
  "password": "secret123"
}

Page Range

# Extract specific pages (via config)
{
  "pdf_path": "manual.pdf",
  "page_range": [1, 100]
}

Local Codebase Analysis

Basic Usage

# Current directory
skill-seekers create .

# Specific directory
skill-seekers create ./my-project

# With explicit command
skill-seekers analyze --directory ./my-project

Analysis Presets

# Quick analysis (1-2 min)
skill-seekers analyze --directory ./my-project --preset quick

# Standard analysis (5-10 min) - default
skill-seekers analyze --directory ./my-project --preset standard

# Comprehensive (20-60 min)
skill-seekers analyze --directory ./my-project --preset comprehensive

What Gets Analyzed

Feature	Quick	Standard	Comprehensive
Code structure	✅	✅	✅
API extraction	✅	✅	✅
Comments	-	✅	✅
Patterns	-	✅	✅
Test examples	-	-	✅
How-to guides	-	-	✅
Config patterns	-	-	✅

Language Filtering

# Specific languages
skill-seekers analyze --directory ./my-project \
  --languages Python,JavaScript

# File patterns
skill-seekers analyze --directory ./my-project \
  --file-patterns "*.py,*.js"

Skip Features

# Skip heavy features
skill-seekers analyze --directory ./my-project \
  --skip-dependency-graph \
  --skip-patterns \
  --skip-test-examples

Common Scraping Patterns

Pattern 1: Test First

# Dry run to preview
skill-seekers create <source> --dry-run

# Small test scrape
skill-seekers create <source> --max-pages 10

# Full scrape
skill-seekers create <source>

Pattern 2: Iterative Development

# Scrape without enhancement (fast)
skill-seekers create <source> --enhance-level 0

# Review output
ls output/my-skill/
cat output/my-skill/SKILL.md

# Enhance later
skill-seekers enhance output/my-skill/

Pattern 3: Parallel Processing

# Fast async scraping
skill-seekers create <url> --async --workers 5

# Even faster (be careful with rate limits)
skill-seekers create <url> --async --workers 10 --rate-limit 0.2

Pattern 4: Resume Capability

# Start scraping
skill-seekers create <source>
# ...interrupted...

# Resume later
skill-seekers resume --list
skill-seekers resume <job-id>

Troubleshooting Scraping

"No content extracted"

Problem: Wrong CSS selectors

Solution:

# First, try without a main_content selector (auto-detection)
# The scraper tries: main, div[role="main"], article, .content, etc.
skill-seekers create <url> --dry-run

# If auto-detection fails, find the correct selector:
curl -s <url> | grep -i 'article\|main\|content'

# Then specify it in your config's source:
{
  "sources": [{
    "type": "documentation",
    "base_url": "https://...",
    "selectors": {
      "main_content": "div.content"
    }
  }]
}

"Rate limit exceeded"

Problem: Too many requests

Solution:

# Slow down
skill-seekers create <url> --rate-limit 2.0

# Or use GitHub token for GitHub repos
export GITHUB_TOKEN=ghp_...

"Too many pages"

Problem: Site is larger than expected

Solution:

# Estimate first
skill-seekers estimate configs/my-config.json

# Limit pages
skill-seekers create <url> --max-pages 100

# Adjust URL patterns
{
  "url_patterns": {
    "exclude": ["/blog/", "/archive/", "/search"]
  }
}

"Memory error"

Problem: Site too large for memory

Solution:

# Use streaming mode
skill-seekers create <url> --streaming

# Or smaller chunks
skill-seekers create <url> --chunk-tokens 500

Performance Tips

Tip	Command	Impact
Use presets	`--config react`	Faster setup
Async mode	`--async --workers 5`	3-5x faster
Skip enhancement	`--enhance-level 0`	Skip 60 sec
Use cache	`--skip-scrape`	Instant rebuild
Resume	`--resume`	Continue interrupted

Next Steps

Enhancement Guide - Improve skill quality
Packaging Guide - Export to platforms
Config Format - Advanced configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scraping Guide

Overview

Documentation Scraping

Basic Usage

Using Preset Configs

Custom Configuration

Advanced Options

GitHub Repository Scraping

Basic Usage

With GitHub Token

What Gets Extracted

Control What to Fetch

PDF Extraction

Basic Usage

OCR for Scanned PDFs

Password-Protected PDFs

Page Range

Local Codebase Analysis

Basic Usage

Analysis Presets

What Gets Analyzed

Language Filtering

Skip Features

Common Scraping Patterns

Pattern 1: Test First

Pattern 2: Iterative Development

Pattern 3: Parallel Processing

Pattern 4: Resume Capability

Troubleshooting Scraping

"No content extracted"

"Rate limit exceeded"

"Too many pages"

"Memory error"

Performance Tips

Next Steps

Uh oh!

FilesExpand file tree

02-scraping.md

Latest commit

History

02-scraping.md

File metadata and controls

Scraping Guide

Overview

Documentation Scraping

Basic Usage

Using Preset Configs

Custom Configuration

Advanced Options

GitHub Repository Scraping

Basic Usage

With GitHub Token

What Gets Extracted

Control What to Fetch

PDF Extraction

Basic Usage

OCR for Scanned PDFs

Password-Protected PDFs

Page Range

Local Codebase Analysis

Basic Usage

Analysis Presets

What Gets Analyzed

Language Filtering

Skip Features

Common Scraping Patterns

Pattern 1: Test First

Pattern 2: Iterative Development

Pattern 3: Parallel Processing

Pattern 4: Resume Capability

Troubleshooting Scraping

"No content extracted"

"Rate limit exceeded"

"Too many pages"

"Memory error"

Performance Tips

Next Steps