AWS Documentation Reference

This repository contains crawled API documentation from multiple sources:

Terraform AWS Provider
Pulumi AWS Provider
Boto3 SDK
Crawl4AI Documentation

Directory Structure

output/
  terraform/           # Terraform AWS Provider docs
    resources/
    data-sources/
    guides/
  pulumi/             # Pulumi AWS Provider docs
    resources/
    api/
    guides/
  boto3/              # AWS SDK for Python docs
    services/
    guides/
  crawl4ai/           # Crawl4AI documentation
    core/
    api/
    guides/

Each documentation source follows a consistent format:

Markdown files for human reading
JSON files for machine consumption
Original HTML (optional, for debugging)

Workflow

Crawl Documentation

# Crawl a specific source
python crawler.py terraform_aws

# Or crawl multiple sources
python crawler.py pulumi_aws boto3 terraform_aws

This creates:

Human-readable markdown in output/<source>/
Machine-readable JSON in output/json_reference/<source>/

Use Documentation Programmatically

from doc_loader import APIDocLoader

# Initialize with crawled docs
loader = APIDocLoader("output/json_reference")

# Search across all documentation
results = loader.semantic_search(
    "How to create an S3 bucket with versioning?"
)

# Get examples for a specific service
examples = loader.get_api_examples(
    service_name="s3",
    source="pulumi_aws"
)

Structure

docs/
- terraform/ # Terraform AWS Provider docs
- pulumi/ # Pulumi AWS Provider docs
- boto3/ # Boto3 SDK docs
- crawl4ai/ # Crawl4AI documentation
output/json_reference/ # Structured JSON for programmatic use

Using the Documentation Loader

The doc_loader.py provides programmatic access to the crawled documentation:

from doc_loader import APIDocLoader, format_for_llm

# Initialize the loader
loader = APIDocLoader("output/json_reference")

# 1. Semantic Search
# Find relevant documentation based on natural language queries
results = loader.semantic_search(
    query="How to create an S3 bucket with versioning?",
    top_k=3,  # Return top 3 results
    source_filter="pulumi_aws"  # Optional: filter by source
)

# 2. Service-Specific Documentation
# Get all documentation for a service
s3_docs = loader.get_service_docs(
    service_name="s3",
    source="boto3"  # Optional: specify source
)

# 3. Code Examples
# Extract code examples for a service/method
examples = loader.get_api_examples(
    service_name="s3",
    method_name="create_bucket",  # Optional: filter by method
    source="terraform_aws"  # Optional: specify source
)

# 4. Format for LLM Consumption
# Format results in a way that's optimal for LLMs
formatted_docs = format_for_llm(results)

Features

Semantic Search
- Uses sentence embeddings for meaning-based search
- Supports filtering by documentation source
- Returns ranked results by relevance
Service Documentation
- Retrieve all documentation for specific services
- Filter by source (e.g., Terraform, Pulumi, Boto3)
- Access structured content including overview, API reference, and examples
Code Examples
- Extract relevant code examples
- Filter by service, method, or source
- Includes language information and context
LLM Integration
- Format documentation for optimal LLM consumption
- Includes metadata, overview, API reference, and examples
- Structured for easy parsing and context understanding

Guide: Analyzing API Documentation Sites for crawl4ai

1. Initial Analysis Tools

# Chrome DevTools shortcuts
F12 or Cmd+Opt+I (Mac)  # Open DevTools
Cmd+Shift+C (Mac)       # Enable element selector

2. Key Elements to Identify

Main Content Container:

# Common patterns to look for
selectors = [
    "main",                    # Modern sites often use semantic HTML
    "article",                 # Documentation articles
    ".content",               # Content class
    ".documentation",         # Documentation class
    "[role='main']",         # ARIA role
    ".markdown-body"         # Documentation body (e.g., GitHub)
]

# Example crawl4ai config
config = CrawlerRunConfig(
    wait_for="css:main",        # Wait for main content
    css_selector="main",        # Extract main content
    wait_until="load"           # Wait for page load
)

Navigation Elements:

# Common navigation patterns
nav_selectors = [
    "nav",                     # Semantic nav element
    ".sidebar",               # Sidebar navigation
    ".toc",                  # Table of contents
    ".menu",                 # Menu container
    "[role='navigation']"    # ARIA navigation
]

# Example link extraction
links = result.links['internal']  # crawl4ai's native link extraction

Code Blocks:

# Common code block patterns
code_selectors = [
    "pre",                    # Preformatted code
    "code",                   # Inline code
    ".highlight",            # Syntax highlighting
    ".code-block"           # Code block container
]

3. Analysis Process

URL Structure Analysis:

# Example URL patterns to look for
patterns = {
    "section_urls": "/docs/section/",
    "api_urls": "/api-reference/",
    "anchors": "#section-name",
    "versioned": "/v2/docs/"
}

# crawl4ai configuration
config = CrawlerRunConfig(
    # Skip certain patterns
    url_filter=lambda url: not url.endswith('.png')
)

Content Loading:

# Check if content is:
# 1. Static HTML
config = CrawlerRunConfig(
    wait_until="domcontentloaded"  # Faster for static content
)

# 2. Dynamic JavaScript
config = CrawlerRunConfig(
    wait_until="networkidle",     # Wait for dynamic content
    wait_for="css:.content"      # Wait for specific element
)

Site Structure Mapping:

# Common documentation structures
structures = {
    "nested": {
        "url": "base_url/section/subsection",
        "config": CrawlerRunConfig(
            wait_for="css:main",
            css_selector="main"
        )
    },
    "flat": {
        "url": "base_url/page-name",
        "config": CrawlerRunConfig(
            wait_for="css:article",
            css_selector="article"
        )
    },
    "anchored": {
        "url": "base_url/page#section",
        "config": CrawlerRunConfig(
            wait_for="css:.section",
            css_selector=".section"
        )
    }
}

4. crawl4ai Configuration Examples

Basic Static Site:

config = CrawlerRunConfig(
    wait_for="css:main",
    wait_until="domcontentloaded",
    process_iframes=False,
    only_text=False,
    css_selector="main"
)

Dynamic Documentation:

config = CrawlerRunConfig(
    wait_for="css:.content",
    wait_until="networkidle",
    process_iframes=True,
    only_text=False,
    css_selector=".content"
)

API Reference:

config = CrawlerRunConfig(
    wait_for="css:.api-content",
    wait_until="networkidle",
    process_iframes=False,
    only_text=False,
    css_selector=".api-content",
    url_filter=lambda url: "/api/" in url
)

5. Common Challenges and Solutions

Dynamic Content:

# Wait for dynamic content to load
config = CrawlerRunConfig(
    wait_until="networkidle",
    timeout=30000  # 30 seconds
)

Infinite Scrolling:

# Handle pagination or infinite scroll
config = CrawlerRunConfig(
    wait_for="css:.content",
    scroll_to_bottom=True,
    scroll_timeout=5000
)

Authentication:

# Handle authenticated content
browser_config = BrowserConfig(
    headless=True,
    storage_state="auth.json"  # Save authenticated state
)

Rate Limiting:

# Implement rate limiting
rate_limits = {
    "docs.example.com": 2,  # 2 requests per second
}

6. Testing Your Selectors

Chrome DevTools Console:

// Test CSS selectors
document.querySelector('main')
document.querySelectorAll('nav a')

// Test content extraction
document.querySelector('main').textContent

Python Interactive Testing:

# Quick selector testing
async def test_selectors(url, selector):
    config = CrawlerRunConfig(
        wait_for=f"css:{selector}",
        css_selector=selector
    )
    result = await crawler.arun(url=url, config=config)
    print(f"Found content: {bool(result.html)}")
    return result

7. Best Practices

Start Simple:

# Begin with basic selectors
config = CrawlerRunConfig(
    wait_for="css:main",
    css_selector="main"
)

Iterate and Refine:

# Add more specific selectors
config = CrawlerRunConfig(
    wait_for="css:main",
    css_selector="main article",
    url_filter=lambda url: "/docs/" in url
)

Document Your Analysis:

# Comment your findings
sources = {
    "example_docs": {
        "url": "https://docs.example.com",
        "selector": "main",  # Main content wrapper
        "link_selector": "nav a",  # Navigation links
        "code_selector": "pre code"  # Code blocks
    }
}

Usage

This repository is designed to be used as a git submodule in your projects:

git submodule add https://github.com/your-org/aws-docs-reference .docs/aws

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
.gitignore		.gitignore
README.md		README.md
aws_cdk_python_crawler.py		aws_cdk_python_crawler.py
aws_go_sdk_crawler.py		aws_go_sdk_crawler.py
base.py		base.py
boto3_crawler.py		boto3_crawler.py
boto3_native_crawler.md		boto3_native_crawler.md
cloudformation_crawler.md		cloudformation_crawler.md
cloudformation_crawler.py		cloudformation_crawler.py
combine_docs.py		combine_docs.py
crawl4ai_crawler.py		crawl4ai_crawler.py
crawl4ai_reference.md		crawl4ai_reference.md
crawl_single_page.py		crawl_single_page.py
crawl_single_page_llm.py		crawl_single_page_llm.py
crawler.py		crawler.py
crawler_native.md		crawler_native.md
doc_loader.py		doc_loader.py
docling_crawler.md		docling_crawler.md
docling_crawler.py		docling_crawler.py
example_usage.py		example_usage.py
gitignore		gitignore
gosdk_crawler.py		gosdk_crawler.py
langtrace_crawler.py		langtrace_crawler.py
main.py		main.py
omonitor.py		omonitor.py
pulumi_aws.md		pulumi_aws.md
pulumi_aws_crawler.py		pulumi_aws_crawler.py
pulumi_crawler.py		pulumi_crawler.py
pydantic_ai_crawler.md		pydantic_ai_crawler.md
pydantic_ai_crawler.py		pydantic_ai_crawler.py
pyproject.toml		pyproject.toml
query_docs.py		query_docs.py
replacements.txt		replacements.txt
terraform_crawler.md		terraform_crawler.md
terraform_crawler.py		terraform_crawler.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AWS Documentation Reference

Directory Structure

Workflow

Structure

Using the Documentation Loader

Features

Guide: Analyzing API Documentation Sites for crawl4ai

1. Initial Analysis Tools

2. Key Elements to Identify

3. Analysis Process

4. crawl4ai Configuration Examples

5. Common Challenges and Solutions

6. Testing Your Selectors

7. Best Practices

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

NetDevAutomate/Crawl4ai-API-Doc-Crawler

Folders and files

Latest commit

History

Repository files navigation

AWS Documentation Reference

Directory Structure

Workflow

Structure

Using the Documentation Loader

Features

Guide: Analyzing API Documentation Sites for crawl4ai

1. Initial Analysis Tools

2. Key Elements to Identify

3. Analysis Process

4. crawl4ai Configuration Examples

5. Common Challenges and Solutions

6. Testing Your Selectors

7. Best Practices

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages