Skip to content

kolchfa-aws/ai-pipeline

AI Documentation Pipeline

A Python-based automation pipeline that transforms API metadata into structured Markdown documentation using large language model APIs. This project demonstrates documentation engineering automation workflows with quality validation and testing.

Python 3.9+ License: MIT

Overview

This pipeline loads API metadata, transforms it into structured prompts, calls a large language model via API, and renders validated Markdown documentation. The system includes basic quality tests to ensure consistent, repeatable output suitable for documentation engineering workflows.

Key features:

  • Automated pipeline: End-to-end documentation generation from API specs
  • Multiple input formats: Supports OpenAPI JSON/YAML, CSV datasets, raw API descriptions
  • LLM integration: Optimized prompts for GPT-4, GPT-3.5, and other models
  • Quality validation: Comprehensive validation with error detection and suggestions
  • Structured output: Clean, consistent Markdown with proper formatting
  • Testing and optimization: Full test suite with mock API calls
  • Async support: Parallel generation for large API specifications

Architecture

Source Input (OpenAPI/JSON/CSV)
        ↓
Python Automation (loader.py)
        ↓
Prompt Optimization (transformer.py)
        ↓
LLM API Call (generator.py)
        ↓
Structured Markdown (renderer.py)
        ↓
Quality Validation (validator.py)

Project structure

ai_docs_pipeline/
├── pipeline/
│   ├── loader.py        # Load API data (JSON/YAML/CSV)
│   ├── transformer.py   # Prep prompts & normalize input
│   ├── generator.py     # OpenAI API calls & retry logic
│   ├── renderer.py      # Markdown output & formatting
│   └── validator.py     # Quality checks & validation
├── samples/
│   ├── sample_api.json  # User management API example
│   └── sample_api.yaml  # E-commerce API example
├── tests/
│   └── test_pipeline.py # Comprehensive test suite
├── output/              # Generated documentation
├── cli.py              # Command-line interface
└── __init__.py         # Package exports

Installation

Prerequisites

  • Python 3.9 or higher
  • OpenAI API key (or compatible LLM API)

Install dependencies

# Clone the repository
git clone <repository-url>
cd ai-pipeline

# Install in development mode
pip install -e .

# Or install dependencies directly
pip install -r requirements.txt

Set up API key

# Option 1: Environment variable (recommended)
export OPENAI_API_KEY="sk-your-api-key-here"

# Option 2: Pass via command line
docs-pipeline input.yaml output.md --api-key sk-your-key

Quick start

Basic usage

# Generate documentation from OpenAPI spec
docs-pipeline samples/sample_api.json output/api_docs.md

# Use specific model with validation
docs-pipeline samples/sample_api.yaml output/ecommerce_docs.md \\
    --model gpt-4o-mini \\
    --validate

# Generate separate files for each endpoint (recommended for large APIs)
docs-pipeline samples/customer_api.yaml output/customer_docs \\
    --model gpt-4o-mini \\
    --separate-files

# Async generation for large APIs
docs-pipeline large_api.json output/docs.md --async

Python API usage

from ai_docs_pipeline import APILoader, DocGenerator, MarkdownRenderer
from ai_docs_pipeline.pipeline.generator import GenerationConfig

# Load API specification
loader = APILoader()
spec = loader.load_from_file("samples/sample_api.json")

# Configure generation
config = GenerationConfig(
    model="gpt-4-turbo-preview",
    max_tokens=2000,
    temperature=0.3
)

# Generate documentation
generator = DocGenerator("your-api-key", config)
overview = generator.generate_overview_docs(spec)
endpoint_docs = [
    generator.generate_endpoint_docs(endpoint, spec.title)
    for endpoint in spec.endpoints
]

# Render final documentation
renderer = MarkdownRenderer()
final_doc = renderer.render_complete_documentation(
    overview,
    endpoint_docs,
    metadata={
        'title': spec.title,
        'version': spec.version,
        'base_url': spec.base_url
    }
)

# Save to file
output_path = Path("output/generated_docs.md")
renderer.save_to_file(final_doc, output_path)

Quick Reference - Setup Steps

  1. Create and activate a Python virtual environment:

    python3 -m venv venv
    source venv/bin/activate  # On macOS/Linux
    # or on Windows: venv\Scripts\activate
  2. Install the project in development mode:

    pip install -e .

    Or alternatively, install dependencies directly:

    pip install -r requirements.txt
  3. Set your OpenAI API key:

    export OPENAI_API_KEY="sk-your-api-key-here"
  4. Run the application using the CLI with a sample API:

    # Generate single combined file
    docs-pipeline samples/sample_api.yaml output/sample_docs.md --model gpt-4o-mini
    
    # Generate separate files for each endpoint (recommended)
    docs-pipeline samples/customer_api.yaml output/customer_docs --model gpt-4o-mini --separate-files

    Alternatively, you can use the Python module syntax:

    python -m ai_docs_pipeline.cli samples/customer_api.yaml output/test_docs.md --model gpt-4o-mini --validate

Output Formats

The pipeline supports two output formats:

Single Combined File (Default)

Generates one large markdown file containing all documentation:

docs-pipeline input.yaml output.md --model gpt-4o-mini
# Creates: output.md (single file with all content)

Separate Files Per Endpoint (Recommended)

Generates individual files for better organization and navigation:

docs-pipeline input.yaml output/my_docs --model gpt-4o-mini --separate-files
# Creates directory: output/my_docs/
# ├── index.md          # Main index with links to all files
# ├── overview.md       # API overview and general info
# ├── get-customers.md  # Individual endpoint files
# ├── post-customer.md
# ├── put-customer.md
# └── generation_report.md

Documentation engineering features

Quality validation

The pipeline includes comprehensive validation to ensure documentation quality:

from ai_docs_pipeline.pipeline.validator import DocumentValidator

validator = DocumentValidator()
result = validator.validate_document(generated_content, "complete")

# Check validation results
if result.has_errors:
    print(f"Errors found: {result.error_count}")
    for issue in result.issues:
        print(f"  • {issue.message}")

# Quality metrics
print(f"Word count: {result.metrics['word_count']:,}")
print(f"Readability score: {result.metrics['readability_score']:.1f}/10")
print(f"Code examples: {result.metrics['code_block_count']}")

Validation checks

  • Structure: Main titles, section hierarchy, content organization
  • Content quality: Placeholder detection, minimum content length, code examples
  • Markdown formatting: Table structure, code blocks, link validation
  • API coverage: Endpoint documentation completeness
  • Readability: Automated readability scoring

Error handling and retry logic

config = GenerationConfig(
    retry_attempts=3,
    retry_delay=1.0,      # Exponential backoff
    timeout=30
)

# Automatic retry on API failures
result = generator.generate_endpoint_docs(endpoint)
if not result.success:
    print(f"Generation failed: {result.metadata['error']}")

Cost estimation

# Estimate API costs before generation
cost_estimate = generator.estimate_cost(api_spec)
print(f"Estimated cost: ${cost_estimate['estimated_cost_usd']:.4f}")
print(f"Input tokens: ~{cost_estimate['estimated_input_tokens']:,}")
print(f"Output tokens: ~{cost_estimate['estimated_output_tokens']:,}")

Advanced usage

Async generation for performance

import asyncio

async def generate_large_api_docs():
    # Generate all documentation concurrently
    results = await generator.generate_all_docs_async(api_spec)

    # Process results
    successful = sum(1 for r in results.values() if r.success)
    print(f"Generated {successful}/{len(results)} sections successfully")

asyncio.run(generate_large_api_docs())

Custom prompt templates

from jinja2 import Template

# Custom endpoint template
custom_template = Template('''
You are creating API documentation for developers.

Endpoint: {{ endpoint.method }} {{ endpoint.path }}
Description: {{ endpoint.description }}

Generate comprehensive documentation including:
1. Clear endpoint description
2. Authentication requirements
3. Request/response examples
4. Error handling information
''')

transformer = PromptTransformer()
transformer.endpoint_template = custom_template

Batch processing

# Process multiple API specs
for spec in api_specs/*.{json,yaml}; do
    echo "Processing $spec..."
    docs-pipeline "$spec" "output/$(basename "$spec" .json).md"
done

Testing

The project includes comprehensive tests covering all pipeline components:

# Run all tests
pytest

# Run with coverage
pytest --cov=ai_docs_pipeline --cov-report=html

# Run specific test categories
pytest tests/test_pipeline.py::TestAPILoader
pytest tests/test_pipeline.py::TestDocGenerator
pytest tests/test_pipeline.py::TestDocumentValidator

Test coverage

  • Unit tests: All pipeline components with mocked LLM calls
  • Integration tests: End-to-end pipeline validation
  • Edge cases: Error handling, malformed inputs, API failures
  • Quality validation: Documentation structure and content validation

Configuration

Environment variables

# OpenAI API configuration
OPENAI_API_KEY=sk-your-api-key-here
OPENAI_ORG_ID=org-your-org-id  # Optional

# Pipeline configuration
DOCS_PIPELINE_MODEL=gpt-4-turbo-preview
DOCS_PIPELINE_MAX_TOKENS=2000
DOCS_PIPELINE_TEMPERATURE=0.3

CLI configuration

# Full configuration example
docs-pipeline input.yaml output.md \\
    --model gpt-4-turbo-preview \\
    --max-tokens 3000 \\
    --temperature 0.2 \\
    --async \\
    --validate

Examples

Sample input (OpenAPI JSON)

{
  "openapi": "3.0.3",
  "info": {
    "title": "User Management API",
    "version": "1.2.0",
    "description": "Comprehensive API for managing user accounts"
  },
  "paths": {
    "/users": {
      "get": {
        "summary": "List all users",
        "parameters": [
          {
            "name": "page",
            "in": "query",
            "schema": {"type": "integer"},
            "description": "Page number for pagination"
          }
        ]
      }
    }
  }
}

Sample output (generated Markdown)

# User Management API

| Field | Value |
|-------|-------|
| Version | 1.2.0 |
| Generated | 2024-01-15 10:30:45 |
| Base URL | `https://api.example.com/v1` |

Comprehensive API for managing user accounts and profiles in web applications.

## Table of contents

- [API endpoints](#api-endpoints)
- [GET /users](#get-users)

## API endpoints

### GET /users

Retrieve a paginated list of all users in the system with filtering options.

#### Parameters

| Name | Type | Required | Location | Description |
|------|------|----------|----------|-------------|
| page | integer | false | query | Page number for pagination |

#### Example request

\\```bash
curl -X GET "https://api.example.com/v1/users?page=1" \\
     -H "Authorization: Bearer your-token"
\\```

#### Example response

\\```json
{
  "users": [...],
  "pagination": {
    "page": 1,
    "total_pages": 10
  }
}
\\```

Development setup

# Install development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run quality checks
black ai_docs_pipeline/ tests/
ruff ai_docs_pipeline/ tests/
mypy ai_docs_pipeline/

Troubleshooting

Common issues

API key errors

Error: OpenAI API key required
# Solution: Set OPENAI_API_KEY environment variable
export OPENAI_API_KEY="sk-your-key"

Rate limiting

# Increase retry delays for rate limits
config = GenerationConfig(
    retry_attempts=5,
    retry_delay=2.0
)

Large API specifications

# Use async mode for better performance
docs-pipeline large_spec.yaml output.md --async --max-tokens 4000

Validation failures

# Check validation details
docs-pipeline input.yaml output.md --validate

# Skip validation if needed
docs-pipeline input.yaml output.md --no-validate

License

MIT License - see LICENSE file for details.

Sources

  • OpenAI: GPT models for documentation generation
  • Jinja2: Template engine for prompt optimization
  • Rich: CLI output and progress indicators
  • Pydantic: Data validation and serialization
  • PyYAML: YAML parsing support

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages