Smart Documentation Scraper for AI Development
Transform any documentation site into AI-ready Markdown files for enhanced coding with LLMs
When coding with Cursor, Claude, or any LLM, you want the most up-to-date documentation at your fingertips. Instead of letting your AI assistant fetch outdated information from the web, Aimdoc gives you:
β¨ Fresh, local documentation that your LLM can reference instantly
π Optimized Markdown designed specifically for AI consumption
π― Smart content extraction that focuses on what matters for developers
π Organized file structure that makes sense to both humans and AI
Aimdoc is a pure Python CLI tool that runs completely locally - no server required!
- Intelligent sitemap discovery - Automatically finds and parses sitemaps
- Smart content extraction - Uses universal CSS selectors to grab the right content
- Chapter organization - Automatically structures docs into logical sections
- Robust error handling - Handles failed pages gracefully with detailed diagnostics
- Interactive setup - Just run
aimdoc scrape
and follow the prompts - Live progress tracking with elegant progress bars and spinners powered by Rich
- Smart defaults - Automatically detects project names and creates organized folders
- 100% local execution - No API server, no network dependencies beyond scraping
- Python 3.8+ with pip
- Basic familiarity with command line
# Clone and run setup script
git clone https://github.com/clemeverger/aimdoc.git
cd aimdoc
./setup.sh
# Install directly from PyPI (when published)
pip install aimdoc
# Or install from source manually
git clone https://github.com/clemeverger/aimdoc.git
cd aimdoc
pip install -e .
# Interactive mode - just follow the prompts!
aimdoc scrape
# Or specify everything upfront
aimdoc scrape https://docs.example.com --name "Example Docs" --output-dir ./my-docs
That's it! Your documentation will be downloaded as clean, AI-ready Markdown files. No server setup required!
# Next.js documentation
aimdoc scrape https://nextjs.org/docs
# React documentation
aimdoc scrape https://react.dev
# Tailwind CSS docs
aimdoc scrape https://tailwindcss.com/docs
# FastAPI documentation
aimdoc scrape https://fastapi.tiangolo.com
# Custom project name and output directory
aimdoc scrape https://docs.python.org --name "Python Official" --output-dir ./references
# See all available commands
aimdoc --help
# Check version
aimdoc version
graph TB
CLI[π₯οΈ CLI Tool] --> Engine[π·οΈ Scrapy Engine]
Engine --> Discover[π Sitemap Discovery]
Engine --> Extract[π Content Extraction]
Engine --> Convert[π Markdown Conversion]
Engine --> Organize[π File Organization]
Discover --> Robots[robots.txt]
Discover --> Sitemap[sitemap.xml]
Extract --> CSS[CSS Selectors]
Extract --> Clean[HTML Cleaning]
Convert --> Markdown[Optimized MD]
Convert --> Structure[Chapter Structure]
- π·οΈ AimdocSpider: Intelligent web crawler with sitemap discovery
- π Markdown Pipeline: Converts HTML to clean, LLM-optimized Markdown
- π Progress Tracker: Real-time CLI progress with Rich UI components
- β‘ CLI Interface: Beautiful command-line experience with Typer
- Clean Markdown: Removes navigation, ads, and irrelevant content
- Consistent formatting: Standardized headings, code blocks, and links
- Logical structure: Organized into chapters and sections
- README generation: Auto-creates navigation index
- Concurrent scraping: Process multiple pages simultaneously
- Intelligent throttling: Respects rate limits and robots.txt
- HTTP caching: Avoids re-downloading unchanged content
- Error recovery: Continues scraping even when some pages fail
- Interactive CLI: No need to memorize commands or flags
- Real-time feedback: Beautiful progress bars and spinners with Rich
- Smart defaults: Works great out of the box
- Local execution: No server setup or management required
- Framework agnostic: Works with any documentation site
- Sitemap discovery: Automatically finds all documentation pages
- Flexible selectors: Adapts to different site structures
- Robust parsing: Handles various HTML layouts
aimdoc/
βββ π€ aimdoc/ # Main Python package
β βββ __main__.py # CLI entry point
β βββ cli/ # CLI components
β β βββ commands.py # Main commands (scrape, version)
β β βββ progress.py # Rich-based progress tracking
β β βββ utils.py # CLI utilities
β βββ spiders/
β β βββ aimdoc.py # Scrapy spider with smart discovery
β βββ pipelines/
β β βββ optimized_html_markdown.py # HTML β Markdown conversion
β β βββ progress_tracker.py # Progress tracking pipeline
β β βββ assemble.py # File organization
β βββ settings.py # Scrapy configuration
β βββ items.py # Scrapy items
βββ setup.py # Package installation
βββ requirements.txt # Dependencies
βββ README.md # This file
You can customize the scraping behavior by modifying aimdoc/settings.py
:
# Increase concurrency for faster scraping
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
# Adjust delays
DOWNLOAD_DELAY = 0.25
AUTOTHROTTLE_START_DELAY = 0.5
# Enable more verbose logging
LOG_LEVEL = 'INFO'
# Run with custom output directory
aimdoc scrape https://docs.example.com --output-dir ~/Documentation
# Specify project name explicitly
aimdoc scrape https://docs.example.com --name "My Project Docs"
# Combine options
aimdoc scrape https://docs.example.com --name "Docs" --output-dir ./references
We love contributions! Here's how to get started:
- Python 3.8+ with pip
- Git
# Fork and clone the repo
git clone https://github.com/clemeverger/aimdoc.git
cd aimdoc
# Run setup script (recommended)
./setup.sh
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .
# Verify installation
aimdoc version
aimdoc scrape --help
# Test with a simple documentation site
aimdoc scrape https://typer.tiangolo.com --name "Test" --output-dir ./test-output
# Test interactive mode
aimdoc scrape
# Run with verbose logging to debug issues
# Edit aimdoc/settings.py and set LOG_LEVEL = 'DEBUG'
aimdoc scrape https://docs.example.com --name "Debug Test" --output-dir ./debug-output
# Test with various documentation structures
aimdoc scrape https://fastapi.tiangolo.com --name "FastAPI" --output-dir ./test-fastapi
aimdoc scrape https://tailwindcss.com/docs --name "Tailwind" --output-dir ./test-tailwind
- Python: Follow PEP 8, use
black
for formatting - Commit messages: Use conventional commits format
- Test your changes with real documentation sites before submitting
β "No sitemap found" or "NO URLS FOUND TO SCRAPE"
# Some sites don't have sitemaps - this is normal behavior
# The scraper attempts multiple discovery methods automatically
# Check the logs for more details about what URLs were tried
β "Permission denied"
# Make sure you have write permissions to the output directory
chmod +w ./docs
# Or choose a different output directory you own
aimdoc scrape https://docs.example.com --output-dir ~/Documents/docs
β "Command not found: aimdoc"
# Make sure you installed the package correctly
pip install -e .
# Or run directly with Python
python -m aimdoc --help
- π Report bugs
- π‘ Request features
- π Check the source code for more details
- Enhanced site discovery - better URL detection algorithms
- Plugin system for custom content extractors
- Multiple output formats (JSON, YAML, etc.)
- Incremental updates - only scrape changed pages
- Batch processing for multiple documentation sites
- Integration with AI coding assistants (direct Claude/Cursor plugins)
This project is licensed under the MIT License - see the LICENSE file for details.
- Scrapy - The powerful and flexible web scraping framework that powers our engine
- Rich - Beautiful terminal formatting and progress bars for the CLI experience
- Typer - Modern Python CLI framework for building the command interface
- Beautiful Soup & Markdownify - HTML parsing and Markdown conversion libraries
Made with β€οΈ for the AI development community
β Star this repo β’ π Report Bug β’ π‘ Request Feature