Skip to content

Schema-Driven Documentation: Pydantic Models + Automated CI/Hook EnforcementΒ #110

@Anselmoo

Description

@Anselmoo

Documentation Infrastructure: Schema-Driven Validation with Pydantic & Automated Enforcement

🎯 Vision

Create a unified, schema-driven documentation pipeline where:

  1. JSON Schemas (docs/schemas/) define the truth
  2. Pydantic models are generated from schemas for type-safe Python validation
  3. Scripts (scripts/) use Pydantic models for consistent validation
  4. Pre-commit hooks enforce compliance on every commit
  5. CI pipeline validates on every PR
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Schema-Driven Validation Pipeline                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  docs/schemas/                    scripts/                    Enforcement    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ docstring-       β”‚  generates β”‚ Pydantic Models  β”‚  used  β”‚ Pre-commit β”‚ β”‚
β”‚  β”‚ schema.json      │──────────▢│ (type-safe)      │───────▢│ Hooks      β”‚ β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€            β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚  β”‚ vitepress-       β”‚            β”‚ validate_*.py    β”‚        β”‚ CI/CD      β”‚ β”‚
β”‚  β”‚ mapping.json     β”‚            β”‚ generate_*.py    β”‚        β”‚ Pipeline   β”‚ β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€            β”‚ check_*.py       β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”‚ default-         β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚  β”‚ mapping.json     β”‚                                                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                        β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Current Assets

JSON Schemas (docs/schemas/)

File Purpose Lines
docstring-schema.json Defines COCO/BBOB docstring structure (AlgorithmMetadata, Args, Attributes, etc.) 722
vitepress-mapping-schema.json Maps docstring sections β†’ VitePress rendering rules 372
default-mapping.json Default configuration + transformation rules 503

Scripts (scripts/) - Currently Independent

Script Purpose Uses Schema?
validate_optimizer_docs.py COCO/BBOB compliance ❌ Regex-based
check_google_docstring_inline_descriptions.py Inline format ❌ Regex-based
batch_update_docstrings.py Generate templates ❌ Hardcoded
generate_docs.py VitePress generation ❌ Partial
fix_docstring_indentation.py Fix indentation ❌ Regex-based
fix_multiline_returns.py Fix Returns format ❌ Regex-based

Problem: Scripts duplicate validation logic instead of using the schema as single source of truth.


🎯 Target Architecture

1. Pydantic Models from JSON Schema

Generate type-safe Python models from docstring-schema.json:

# opt/docstring_models.py (generated from schema)
from pydantic import BaseModel, Field
from typing import Literal

class AlgorithmMetadata(BaseModel):
    algorithm_name: str
    acronym: str = Field(pattern=r"^[A-Z][A-Z0-9-]*$")
    year_introduced: int = Field(ge=1900, le=2100)
    authors: str
    algorithm_class: Literal[
        "Swarm Intelligence", "Evolutionary", "Gradient-Based",
        "Classical", "Metaheuristic", "Physics-Inspired",
        "Probabilistic", "Social-Inspired", "Constrained", "Multi-Objective"
    ]
    complexity: str  # LaTeX notation
    properties: list[str]
    implementation: str = "Python 3.10+"
    coco_compatible: bool

class COCOBBOBSettings(BaseModel):
    search_space: str
    evaluation_budget: str
    default_dimensions: list[int]
    performance_metrics: list[str]

class DocstringSchema(BaseModel):
    """Root model - single source of truth for validation."""
    summary: str = Field(max_length=80)
    algorithm_metadata: AlgorithmMetadata
    coco_bbob_benchmark_settings: COCOBBOBSettings
    args: dict[str, "ArgDefinition"]
    attributes: dict[str, "AttributeDefinition"]
    # ... etc

2. Scripts Use Pydantic Models

# scripts/validate_optimizer_docs.py
from opt.docstring_models import DocstringSchema
from pydantic import ValidationError

def validate_docstring(parsed_docstring: dict) -> list[str]:
    """Validate using Pydantic model (schema-driven)."""
    try:
        DocstringSchema.model_validate(parsed_docstring)
        return []
    except ValidationError as e:
        return [str(err) for err in e.errors()]

3. Pre-Commit Hook Integration

# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: validate-docstrings-pydantic
      name: Validate docstrings against schema (Pydantic)
      entry: python scripts/validate_optimizer_docs.py
      language: python
      files: ^opt/(classical|constrained|...|swarm_intelligence)/.*\.py$
      additional_dependencies: [pydantic>=2.0]

4. CI Pipeline

# .github/workflows/docs-validation.yml
jobs:
  validate-docstrings:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
      - name: Install dependencies
        run: uv sync
      - name: Validate all optimizer docstrings
        run: uv run python scripts/validate_optimizer_docs.py --all
      - name: Generate docs (dry-run)
        run: uv run python scripts/generate_docs.py --dry-run

βœ… Acceptance Criteria

Phase 1: Pydantic Model Generation

  • Install datamodel-code-generator or manual Pydantic models
  • Generate opt/docstring_models.py from docstring-schema.json
  • Add Pydantic as project dependency (pyproject.toml)
  • Validate models match schema definitions

Phase 2: Script Integration

  • Refactor validate_optimizer_docs.py to use Pydantic models
  • Refactor check_google_docstring_inline_descriptions.py to use Pydantic
  • Refactor generate_docs.py to use VitePress mapping schema
  • Add DocstringParser class to convert raw docstring β†’ dict β†’ Pydantic model

Phase 3: Pre-Commit Enforcement

  • Update .pre-commit-config.yaml with Pydantic-based validators
  • Consolidate overlapping hooks into unified validator
  • Add schema validation to pre-commit

Phase 4: CI Pipeline

  • Create .github/workflows/docs-validation.yml
  • Add job: Validate all optimizer docstrings
  • Add job: Generate docs (dry-run validation)
  • Add job: Schema consistency check

Phase 5: Documentation

  • Document Pydantic model usage in scripts/README.md
  • Add schema update workflow documentation
  • Document how to extend schema for new fields

πŸ”§ Implementation Tasks

Task 1: Generate Pydantic Models

# Option A: Use datamodel-code-generator
uv add --dev datamodel-code-generator
datamodel-codegen --input docs/schemas/docstring-schema.json --output opt/docstring_models.py

# Option B: Manual implementation (more control)
# Create opt/docstring_models.py manually based on schema

Task 2: Create DocstringParser

# scripts/docstring_parser.py
import ast
import re
from opt.docstring_models import DocstringSchema

class DocstringParser:
    """Parse Python docstring into validated Pydantic model."""
    
    def parse_file(self, filepath: str) -> DocstringSchema:
        """Extract and validate class docstring from file."""
        with open(filepath) as f:
            tree = ast.parse(f.read())
        # ... parse docstring sections
        return DocstringSchema.model_validate(parsed_dict)

Task 3: Unified Pre-Commit Hook

# scripts/unified_validator.py
"""Unified docstring validator using Pydantic schemas."""
from opt.docstring_models import DocstringSchema
from scripts.docstring_parser import DocstringParser

def main(files: list[str]) -> int:
    parser = DocstringParser()
    errors = []
    for file in files:
        try:
            parser.parse_file(file)  # Pydantic validation happens here
        except ValidationError as e:
            errors.extend(format_errors(file, e))
    if errors:
        print("\n".join(errors))
        return 1
    return 0

πŸ“Š Progress Tracking

Phase Status Tracking
Phase 1: Pydantic Models 🚧 Not Started
Phase 2: Script Integration 🚧 Not Started
Phase 3: Pre-Commit βœ… Partial (hooks exist, need Pydantic)
Phase 4: CI Pipeline 🚧 Not Started
Phase 5: Documentation 🚧 Not Started

Existing Work (Reference)


πŸ”— Related PRs

PR Description
#91 Batch docstring update script
#94 Pre-commit hooks for validation
#100-#112 Category docstring updates (all 10 categories)

πŸ“š References


🏷️ Priority

High - Foundation for consistent documentation across 120 algorithms.

⏱️ Estimated Effort

Task Time
Pydantic model generation 2-3 hours
Script refactoring 4-6 hours
Pre-commit integration 1-2 hours
CI pipeline 2-3 hours
Documentation 2-3 hours
Total 11-17 hours

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions