🤖 Aimdoc

Smart Documentation Scraper for AI Development
Transform any documentation site into AI-ready Markdown files for enhanced coding with LLMs

🎯 Why Aimdoc?

When coding with Cursor, Claude, or any LLM, you want the most up-to-date documentation at your fingertips. Instead of letting your AI assistant fetch outdated information from the web, Aimdoc gives you:

✨ Fresh, local documentation that your LLM can reference instantly
🚀 Optimized Markdown designed specifically for AI consumption
🎯 Smart content extraction that focuses on what matters for developers
📁 Organized file structure that makes sense to both humans and AI

🛠️ How It Works

Aimdoc is a pure Python CLI tool that runs completely locally - no server required!

🕷️ Scrapy-Powered Engine

Intelligent sitemap discovery - Automatically finds and parses sitemaps
Smart content extraction - Uses universal CSS selectors to grab the right content
Chapter organization - Automatically structures docs into logical sections
Robust error handling - Handles failed pages gracefully with detailed diagnostics

⚡ Beautiful CLI Interface

Interactive setup - Just run aimdoc scrape and follow the prompts
Live progress tracking with elegant progress bars and spinners powered by Rich
Smart defaults - Automatically detects project names and creates organized folders
100% local execution - No API server, no network dependencies beyond scraping

🚀 Quick Start

Prerequisites

Python 3.8+ with pip
Basic familiarity with command line

1. Install Aimdoc

Option A: Quick Setup (Recommended)

# Clone and run setup script
git clone https://github.com/clemeverger/aimdoc.git
cd aimdoc
./setup.sh

Option B: Manual Installation

# Install directly from PyPI (when published)
pip install aimdoc

# Or install from source manually
git clone https://github.com/clemeverger/aimdoc.git
cd aimdoc
pip install -e .

2. Scrape Your First Documentation

# Interactive mode - just follow the prompts!
aimdoc scrape

# Or specify everything upfront
aimdoc scrape https://docs.example.com --name "Example Docs" --output-dir ./my-docs

That's it! Your documentation will be downloaded as clean, AI-ready Markdown files. No server setup required!

📖 Usage Examples

Scrape Popular Documentation Sites

# Next.js documentation
aimdoc scrape https://nextjs.org/docs

# React documentation
aimdoc scrape https://react.dev

# Tailwind CSS docs
aimdoc scrape https://tailwindcss.com/docs

# FastAPI documentation
aimdoc scrape https://fastapi.tiangolo.com

Advanced Usage

# Custom project name and output directory
aimdoc scrape https://docs.python.org --name "Python Official" --output-dir ./references

# See all available commands
aimdoc --help

# Check version
aimdoc version

🏗️ Architecture

graph TB
    CLI[🖥️ CLI Tool] --> Engine[🕷️ Scrapy Engine]

    Engine --> Discover[🔍 Sitemap Discovery]
    Engine --> Extract[📄 Content Extraction]
    Engine --> Convert[📝 Markdown Conversion]
    Engine --> Organize[📁 File Organization]

    Discover --> Robots[robots.txt]
    Discover --> Sitemap[sitemap.xml]

    Extract --> CSS[CSS Selectors]
    Extract --> Clean[HTML Cleaning]

    Convert --> Markdown[Optimized MD]
    Convert --> Structure[Chapter Structure]

Key Components

🕷️ AimdocSpider: Intelligent web crawler with sitemap discovery
📝 Markdown Pipeline: Converts HTML to clean, LLM-optimized Markdown
📊 Progress Tracker: Real-time CLI progress with Rich UI components
⚡ CLI Interface: Beautiful command-line experience with Typer

🎨 Features

🤖 AI-Optimized Output

Clean Markdown: Removes navigation, ads, and irrelevant content
Consistent formatting: Standardized headings, code blocks, and links
Logical structure: Organized into chapters and sections
README generation: Auto-creates navigation index

🚀 Performance & Reliability

Concurrent scraping: Process multiple pages simultaneously
Intelligent throttling: Respects rate limits and robots.txt
HTTP caching: Avoids re-downloading unchanged content
Error recovery: Continues scraping even when some pages fail

🎯 Developer Experience

Interactive CLI: No need to memorize commands or flags
Real-time feedback: Beautiful progress bars and spinners with Rich
Smart defaults: Works great out of the box
Local execution: No server setup or management required

🌐 Universal Compatibility

Framework agnostic: Works with any documentation site
Sitemap discovery: Automatically finds all documentation pages
Flexible selectors: Adapts to different site structures
Robust parsing: Handles various HTML layouts

📁 Project Structure

aimdoc/
├── 🤖 aimdoc/                 # Main Python package
│   ├── __main__.py            # CLI entry point
│   ├── cli/                   # CLI components
│   │   ├── commands.py        # Main commands (scrape, version)
│   │   ├── progress.py        # Rich-based progress tracking
│   │   └── utils.py           # CLI utilities
│   ├── spiders/
│   │   └── aimdoc.py          # Scrapy spider with smart discovery
│   ├── pipelines/
│   │   ├── optimized_html_markdown.py  # HTML → Markdown conversion
│   │   ├── progress_tracker.py         # Progress tracking pipeline
│   │   └── assemble.py                 # File organization
│   ├── settings.py            # Scrapy configuration
│   └── items.py              # Scrapy items
├── setup.py                   # Package installation
├── requirements.txt           # Dependencies
└── README.md                  # This file

🔧 Configuration

Custom Scrapy Settings

You can customize the scraping behavior by modifying aimdoc/settings.py:

# Increase concurrency for faster scraping
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4

# Adjust delays
DOWNLOAD_DELAY = 0.25
AUTOTHROTTLE_START_DELAY = 0.5

# Enable more verbose logging
LOG_LEVEL = 'INFO'

CLI Options

# Run with custom output directory
aimdoc scrape https://docs.example.com --output-dir ~/Documentation

# Specify project name explicitly
aimdoc scrape https://docs.example.com --name "My Project Docs"

# Combine options
aimdoc scrape https://docs.example.com --name "Docs" --output-dir ./references

🤝 Contributing

We love contributions! Here's how to get started:

Development Setup

Prerequisites

Python 3.8+ with pip
Git

Quick Setup

# Fork and clone the repo
git clone https://github.com/clemeverger/aimdoc.git
cd aimdoc

# Run setup script (recommended)
./setup.sh

Manual Setup

# Install dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

# Verify installation
aimdoc version
aimdoc scrape --help

Running Tests

Basic Testing

# Test with a simple documentation site
aimdoc scrape https://typer.tiangolo.com --name "Test" --output-dir ./test-output

# Test interactive mode
aimdoc scrape

Debug Mode

# Run with verbose logging to debug issues
# Edit aimdoc/settings.py and set LOG_LEVEL = 'DEBUG'
aimdoc scrape https://docs.example.com --name "Debug Test" --output-dir ./debug-output

Testing Different Sites

# Test with various documentation structures
aimdoc scrape https://fastapi.tiangolo.com --name "FastAPI" --output-dir ./test-fastapi
aimdoc scrape https://tailwindcss.com/docs --name "Tailwind" --output-dir ./test-tailwind

Code Style

Python: Follow PEP 8, use black for formatting
Commit messages: Use conventional commits format
Test your changes with real documentation sites before submitting

🐛 Troubleshooting

Common Issues

❌ "No sitemap found" or "NO URLS FOUND TO SCRAPE"

# Some sites don't have sitemaps - this is normal behavior
# The scraper attempts multiple discovery methods automatically
# Check the logs for more details about what URLs were tried

❌ "Permission denied"

# Make sure you have write permissions to the output directory
chmod +w ./docs

# Or choose a different output directory you own
aimdoc scrape https://docs.example.com --output-dir ~/Documents/docs

❌ "Command not found: aimdoc"

# Make sure you installed the package correctly
pip install -e .

# Or run directly with Python
python -m aimdoc --help

Getting Help

🐛 Report bugs
💡 Request features
📖 Check the source code for more details

📋 Roadmap

Enhanced site discovery - better URL detection algorithms
Plugin system for custom content extractors
Multiple output formats (JSON, YAML, etc.)
Incremental updates - only scrape changed pages
Batch processing for multiple documentation sites
Integration with AI coding assistants (direct Claude/Cursor plugins)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Scrapy - The powerful and flexible web scraping framework that powers our engine
Rich - Beautiful terminal formatting and progress bars for the CLI experience
Typer - Modern Python CLI framework for building the command interface
Beautiful Soup & Markdownify - HTML parsing and Markdown conversion libraries

Made with ❤️ for the AI development community

⭐ Star this repo • 🐛 Report Bug • 💡 Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
aimdoc		aimdoc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
render.yaml		render.yaml
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py
setup.sh		setup.sh

License

clemeverger/aimdoc

Folders and files

Latest commit

History

Repository files navigation

🤖 Aimdoc

🎯 Why Aimdoc?

🛠️ How It Works

🕷️ Scrapy-Powered Engine

⚡ Beautiful CLI Interface

🚀 Quick Start

Prerequisites

1. Install Aimdoc

Option A: Quick Setup (Recommended)

Option B: Manual Installation

2. Scrape Your First Documentation

📖 Usage Examples

Scrape Popular Documentation Sites

Advanced Usage

🏗️ Architecture

Key Components

🎨 Features

🤖 AI-Optimized Output

🚀 Performance & Reliability

🎯 Developer Experience

🌐 Universal Compatibility

📁 Project Structure

🔧 Configuration

Custom Scrapy Settings

CLI Options

🤝 Contributing

Development Setup

Prerequisites

Quick Setup

Manual Setup

Running Tests

Basic Testing

Debug Mode

Testing Different Sites

Code Style

🐛 Troubleshooting

Common Issues

Getting Help

📋 Roadmap

📄 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages