A Python-based automation pipeline that transforms API metadata into structured Markdown documentation using large language model APIs. This project demonstrates documentation engineering automation workflows with quality validation and testing.
This pipeline loads API metadata, transforms it into structured prompts, calls a large language model via API, and renders validated Markdown documentation. The system includes basic quality tests to ensure consistent, repeatable output suitable for documentation engineering workflows.
Key features:
- Automated pipeline: End-to-end documentation generation from API specs
- Multiple input formats: Supports OpenAPI JSON/YAML, CSV datasets, raw API descriptions
- LLM integration: Optimized prompts for GPT-4, GPT-3.5, and other models
- Quality validation: Comprehensive validation with error detection and suggestions
- Structured output: Clean, consistent Markdown with proper formatting
- Testing and optimization: Full test suite with mock API calls
- Async support: Parallel generation for large API specifications
Source Input (OpenAPI/JSON/CSV)
↓
Python Automation (loader.py)
↓
Prompt Optimization (transformer.py)
↓
LLM API Call (generator.py)
↓
Structured Markdown (renderer.py)
↓
Quality Validation (validator.py)
ai_docs_pipeline/
├── pipeline/
│ ├── loader.py # Load API data (JSON/YAML/CSV)
│ ├── transformer.py # Prep prompts & normalize input
│ ├── generator.py # OpenAI API calls & retry logic
│ ├── renderer.py # Markdown output & formatting
│ └── validator.py # Quality checks & validation
├── samples/
│ ├── sample_api.json # User management API example
│ └── sample_api.yaml # E-commerce API example
├── tests/
│ └── test_pipeline.py # Comprehensive test suite
├── output/ # Generated documentation
├── cli.py # Command-line interface
└── __init__.py # Package exports
- Python 3.9 or higher
- OpenAI API key (or compatible LLM API)
# Clone the repository
git clone <repository-url>
cd ai-pipeline
# Install in development mode
pip install -e .
# Or install dependencies directly
pip install -r requirements.txt# Option 1: Environment variable (recommended)
export OPENAI_API_KEY="sk-your-api-key-here"
# Option 2: Pass via command line
docs-pipeline input.yaml output.md --api-key sk-your-key# Generate documentation from OpenAPI spec
docs-pipeline samples/sample_api.json output/api_docs.md
# Use specific model with validation
docs-pipeline samples/sample_api.yaml output/ecommerce_docs.md \\
--model gpt-4o-mini \\
--validate
# Generate separate files for each endpoint (recommended for large APIs)
docs-pipeline samples/customer_api.yaml output/customer_docs \\
--model gpt-4o-mini \\
--separate-files
# Async generation for large APIs
docs-pipeline large_api.json output/docs.md --asyncfrom ai_docs_pipeline import APILoader, DocGenerator, MarkdownRenderer
from ai_docs_pipeline.pipeline.generator import GenerationConfig
# Load API specification
loader = APILoader()
spec = loader.load_from_file("samples/sample_api.json")
# Configure generation
config = GenerationConfig(
model="gpt-4-turbo-preview",
max_tokens=2000,
temperature=0.3
)
# Generate documentation
generator = DocGenerator("your-api-key", config)
overview = generator.generate_overview_docs(spec)
endpoint_docs = [
generator.generate_endpoint_docs(endpoint, spec.title)
for endpoint in spec.endpoints
]
# Render final documentation
renderer = MarkdownRenderer()
final_doc = renderer.render_complete_documentation(
overview,
endpoint_docs,
metadata={
'title': spec.title,
'version': spec.version,
'base_url': spec.base_url
}
)
# Save to file
output_path = Path("output/generated_docs.md")
renderer.save_to_file(final_doc, output_path)-
Create and activate a Python virtual environment:
python3 -m venv venv source venv/bin/activate # On macOS/Linux # or on Windows: venv\Scripts\activate
-
Install the project in development mode:
pip install -e .Or alternatively, install dependencies directly:
pip install -r requirements.txt
-
Set your OpenAI API key:
export OPENAI_API_KEY="sk-your-api-key-here"
-
Run the application using the CLI with a sample API:
# Generate single combined file docs-pipeline samples/sample_api.yaml output/sample_docs.md --model gpt-4o-mini # Generate separate files for each endpoint (recommended) docs-pipeline samples/customer_api.yaml output/customer_docs --model gpt-4o-mini --separate-files
Alternatively, you can use the Python module syntax:
python -m ai_docs_pipeline.cli samples/customer_api.yaml output/test_docs.md --model gpt-4o-mini --validate
The pipeline supports two output formats:
Generates one large markdown file containing all documentation:
docs-pipeline input.yaml output.md --model gpt-4o-mini
# Creates: output.md (single file with all content)Generates individual files for better organization and navigation:
docs-pipeline input.yaml output/my_docs --model gpt-4o-mini --separate-files
# Creates directory: output/my_docs/
# ├── index.md # Main index with links to all files
# ├── overview.md # API overview and general info
# ├── get-customers.md # Individual endpoint files
# ├── post-customer.md
# ├── put-customer.md
# └── generation_report.mdThe pipeline includes comprehensive validation to ensure documentation quality:
from ai_docs_pipeline.pipeline.validator import DocumentValidator
validator = DocumentValidator()
result = validator.validate_document(generated_content, "complete")
# Check validation results
if result.has_errors:
print(f"Errors found: {result.error_count}")
for issue in result.issues:
print(f" • {issue.message}")
# Quality metrics
print(f"Word count: {result.metrics['word_count']:,}")
print(f"Readability score: {result.metrics['readability_score']:.1f}/10")
print(f"Code examples: {result.metrics['code_block_count']}")- Structure: Main titles, section hierarchy, content organization
- Content quality: Placeholder detection, minimum content length, code examples
- Markdown formatting: Table structure, code blocks, link validation
- API coverage: Endpoint documentation completeness
- Readability: Automated readability scoring
config = GenerationConfig(
retry_attempts=3,
retry_delay=1.0, # Exponential backoff
timeout=30
)
# Automatic retry on API failures
result = generator.generate_endpoint_docs(endpoint)
if not result.success:
print(f"Generation failed: {result.metadata['error']}")# Estimate API costs before generation
cost_estimate = generator.estimate_cost(api_spec)
print(f"Estimated cost: ${cost_estimate['estimated_cost_usd']:.4f}")
print(f"Input tokens: ~{cost_estimate['estimated_input_tokens']:,}")
print(f"Output tokens: ~{cost_estimate['estimated_output_tokens']:,}")import asyncio
async def generate_large_api_docs():
# Generate all documentation concurrently
results = await generator.generate_all_docs_async(api_spec)
# Process results
successful = sum(1 for r in results.values() if r.success)
print(f"Generated {successful}/{len(results)} sections successfully")
asyncio.run(generate_large_api_docs())from jinja2 import Template
# Custom endpoint template
custom_template = Template('''
You are creating API documentation for developers.
Endpoint: {{ endpoint.method }} {{ endpoint.path }}
Description: {{ endpoint.description }}
Generate comprehensive documentation including:
1. Clear endpoint description
2. Authentication requirements
3. Request/response examples
4. Error handling information
''')
transformer = PromptTransformer()
transformer.endpoint_template = custom_template# Process multiple API specs
for spec in api_specs/*.{json,yaml}; do
echo "Processing $spec..."
docs-pipeline "$spec" "output/$(basename "$spec" .json).md"
doneThe project includes comprehensive tests covering all pipeline components:
# Run all tests
pytest
# Run with coverage
pytest --cov=ai_docs_pipeline --cov-report=html
# Run specific test categories
pytest tests/test_pipeline.py::TestAPILoader
pytest tests/test_pipeline.py::TestDocGenerator
pytest tests/test_pipeline.py::TestDocumentValidator- Unit tests: All pipeline components with mocked LLM calls
- Integration tests: End-to-end pipeline validation
- Edge cases: Error handling, malformed inputs, API failures
- Quality validation: Documentation structure and content validation
# OpenAI API configuration
OPENAI_API_KEY=sk-your-api-key-here
OPENAI_ORG_ID=org-your-org-id # Optional
# Pipeline configuration
DOCS_PIPELINE_MODEL=gpt-4-turbo-preview
DOCS_PIPELINE_MAX_TOKENS=2000
DOCS_PIPELINE_TEMPERATURE=0.3# Full configuration example
docs-pipeline input.yaml output.md \\
--model gpt-4-turbo-preview \\
--max-tokens 3000 \\
--temperature 0.2 \\
--async \\
--validate{
"openapi": "3.0.3",
"info": {
"title": "User Management API",
"version": "1.2.0",
"description": "Comprehensive API for managing user accounts"
},
"paths": {
"/users": {
"get": {
"summary": "List all users",
"parameters": [
{
"name": "page",
"in": "query",
"schema": {"type": "integer"},
"description": "Page number for pagination"
}
]
}
}
}
}# User Management API
| Field | Value |
|-------|-------|
| Version | 1.2.0 |
| Generated | 2024-01-15 10:30:45 |
| Base URL | `https://api.example.com/v1` |
Comprehensive API for managing user accounts and profiles in web applications.
## Table of contents
- [API endpoints](#api-endpoints)
- [GET /users](#get-users)
## API endpoints
### GET /users
Retrieve a paginated list of all users in the system with filtering options.
#### Parameters
| Name | Type | Required | Location | Description |
|------|------|----------|----------|-------------|
| page | integer | false | query | Page number for pagination |
#### Example request
\\```bash
curl -X GET "https://api.example.com/v1/users?page=1" \\
-H "Authorization: Bearer your-token"
\\```
#### Example response
\\```json
{
"users": [...],
"pagination": {
"page": 1,
"total_pages": 10
}
}
\\```# Install development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run quality checks
black ai_docs_pipeline/ tests/
ruff ai_docs_pipeline/ tests/
mypy ai_docs_pipeline/API key errors
Error: OpenAI API key required
# Solution: Set OPENAI_API_KEY environment variable
export OPENAI_API_KEY="sk-your-key"Rate limiting
# Increase retry delays for rate limits
config = GenerationConfig(
retry_attempts=5,
retry_delay=2.0
)Large API specifications
# Use async mode for better performance
docs-pipeline large_spec.yaml output.md --async --max-tokens 4000Validation failures
# Check validation details
docs-pipeline input.yaml output.md --validate
# Skip validation if needed
docs-pipeline input.yaml output.md --no-validateMIT License - see LICENSE file for details.
- OpenAI: GPT models for documentation generation
- Jinja2: Template engine for prompt optimization
- Rich: CLI output and progress indicators
- Pydantic: Data validation and serialization
- PyYAML: YAML parsing support