Schema-Driven Documentation: Pydantic Models + Automated CI/Hook Enforcement

# Documentation Infrastructure: Schema-Driven Validation with Pydantic & Automated Enforcement

## 🎯 Vision

Create a **unified, schema-driven documentation pipeline** where:
1. **JSON Schemas** (`docs/schemas/`) define the truth
2. **Pydantic models** are generated from schemas for type-safe Python validation
3. **Scripts** (`scripts/`) use Pydantic models for consistent validation
4. **Pre-commit hooks** enforce compliance on every commit
5. **CI pipeline** validates on every PR

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    Schema-Driven Validation Pipeline                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  docs/schemas/                    scripts/                    Enforcement    │
│  ┌──────────────────┐            ┌──────────────────┐        ┌────────────┐ │
│  │ docstring-       │  generates │ Pydantic Models  │  used  │ Pre-commit │ │
│  │ schema.json      │──────────▶│ (type-safe)      │───────▶│ Hooks      │ │
│  ├──────────────────┤            ├──────────────────┤        ├────────────┤ │
│  │ vitepress-       │            │ validate_*.py    │        │ CI/CD      │ │
│  │ mapping.json     │            │ generate_*.py    │        │ Pipeline   │ │
│  ├──────────────────┤            │ check_*.py       │        └────────────┘ │
│  │ default-         │            └──────────────────┘                       │
│  │ mapping.json     │                                                        │
│  └──────────────────┘                                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## 📁 Current Assets

### JSON Schemas (`docs/schemas/`)

| File | Purpose | Lines |
|------|---------|-------|
| `docstring-schema.json` | Defines COCO/BBOB docstring structure (AlgorithmMetadata, Args, Attributes, etc.) | 722 |
| `vitepress-mapping-schema.json` | Maps docstring sections → VitePress rendering rules | 372 |
| `default-mapping.json` | Default configuration + transformation rules | 503 |

### Scripts (`scripts/`) - Currently Independent

| Script | Purpose | Uses Schema? |
|--------|---------|--------------|
| `validate_optimizer_docs.py` | COCO/BBOB compliance | ❌ Regex-based |
| `check_google_docstring_inline_descriptions.py` | Inline format | ❌ Regex-based |
| `batch_update_docstrings.py` | Generate templates | ❌ Hardcoded |
| `generate_docs.py` | VitePress generation | ❌ Partial |
| `fix_docstring_indentation.py` | Fix indentation | ❌ Regex-based |
| `fix_multiline_returns.py` | Fix Returns format | ❌ Regex-based |

**Problem**: Scripts duplicate validation logic instead of using the schema as single source of truth.

---

## 🎯 Target Architecture

### 1. Pydantic Models from JSON Schema

Generate type-safe Python models from `docstring-schema.json`:

```python
# opt/docstring_models.py (generated from schema)
from pydantic import BaseModel, Field
from typing import Literal

class AlgorithmMetadata(BaseModel):
    algorithm_name: str
    acronym: str = Field(pattern=r"^[A-Z][A-Z0-9-]*$")
    year_introduced: int = Field(ge=1900, le=2100)
    authors: str
    algorithm_class: Literal[
        "Swarm Intelligence", "Evolutionary", "Gradient-Based",
        "Classical", "Metaheuristic", "Physics-Inspired",
        "Probabilistic", "Social-Inspired", "Constrained", "Multi-Objective"
    ]
    complexity: str  # LaTeX notation
    properties: list[str]
    implementation: str = "Python 3.10+"
    coco_compatible: bool

class COCOBBOBSettings(BaseModel):
    search_space: str
    evaluation_budget: str
    default_dimensions: list[int]
    performance_metrics: list[str]

class DocstringSchema(BaseModel):
    """Root model - single source of truth for validation."""
    summary: str = Field(max_length=80)
    algorithm_metadata: AlgorithmMetadata
    coco_bbob_benchmark_settings: COCOBBOBSettings
    args: dict[str, "ArgDefinition"]
    attributes: dict[str, "AttributeDefinition"]
    # ... etc
```

### 2. Scripts Use Pydantic Models

```python
# scripts/validate_optimizer_docs.py
from opt.docstring_models import DocstringSchema
from pydantic import ValidationError

def validate_docstring(parsed_docstring: dict) -> list[str]:
    """Validate using Pydantic model (schema-driven)."""
    try:
        DocstringSchema.model_validate(parsed_docstring)
        return []
    except ValidationError as e:
        return [str(err) for err in e.errors()]
```

### 3. Pre-Commit Hook Integration

```yaml
# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: validate-docstrings-pydantic
      name: Validate docstrings against schema (Pydantic)
      entry: python scripts/validate_optimizer_docs.py
      language: python
      files: ^opt/(classical|constrained|...|swarm_intelligence)/.*\.py$
      additional_dependencies: [pydantic>=2.0]
```

### 4. CI Pipeline

```yaml
# .github/workflows/docs-validation.yml
jobs:
  validate-docstrings:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
      - name: Install dependencies
        run: uv sync
      - name: Validate all optimizer docstrings
        run: uv run python scripts/validate_optimizer_docs.py --all
      - name: Generate docs (dry-run)
        run: uv run python scripts/generate_docs.py --dry-run
```

---

## ✅ Acceptance Criteria

### Phase 1: Pydantic Model Generation
- [ ] Install `datamodel-code-generator` or manual Pydantic models
- [ ] Generate `opt/docstring_models.py` from `docstring-schema.json`
- [ ] Add Pydantic as project dependency (`pyproject.toml`)
- [ ] Validate models match schema definitions

### Phase 2: Script Integration
- [ ] Refactor `validate_optimizer_docs.py` to use Pydantic models
- [ ] Refactor `check_google_docstring_inline_descriptions.py` to use Pydantic
- [ ] Refactor `generate_docs.py` to use VitePress mapping schema
- [ ] Add `DocstringParser` class to convert raw docstring → dict → Pydantic model

### Phase 3: Pre-Commit Enforcement
- [ ] Update `.pre-commit-config.yaml` with Pydantic-based validators
- [ ] Consolidate overlapping hooks into unified validator
- [ ] Add schema validation to pre-commit

### Phase 4: CI Pipeline
- [ ] Create `.github/workflows/docs-validation.yml`
- [ ] Add job: Validate all optimizer docstrings
- [ ] Add job: Generate docs (dry-run validation)
- [ ] Add job: Schema consistency check

### Phase 5: Documentation
- [ ] Document Pydantic model usage in `scripts/README.md`
- [ ] Add schema update workflow documentation
- [ ] Document how to extend schema for new fields

---

## 🔧 Implementation Tasks

### Task 1: Generate Pydantic Models

```bash
# Option A: Use datamodel-code-generator
uv add --dev datamodel-code-generator
datamodel-codegen --input docs/schemas/docstring-schema.json --output opt/docstring_models.py

# Option B: Manual implementation (more control)
# Create opt/docstring_models.py manually based on schema
```

### Task 2: Create DocstringParser

```python
# scripts/docstring_parser.py
import ast
import re
from opt.docstring_models import DocstringSchema

class DocstringParser:
    """Parse Python docstring into validated Pydantic model."""
    
    def parse_file(self, filepath: str) -> DocstringSchema:
        """Extract and validate class docstring from file."""
        with open(filepath) as f:
            tree = ast.parse(f.read())
        # ... parse docstring sections
        return DocstringSchema.model_validate(parsed_dict)
```

### Task 3: Unified Pre-Commit Hook

```python
# scripts/unified_validator.py
"""Unified docstring validator using Pydantic schemas."""
from opt.docstring_models import DocstringSchema
from scripts.docstring_parser import DocstringParser

def main(files: list[str]) -> int:
    parser = DocstringParser()
    errors = []
    for file in files:
        try:
            parser.parse_file(file)  # Pydantic validation happens here
        except ValidationError as e:
            errors.extend(format_errors(file, e))
    if errors:
        print("\n".join(errors))
        return 1
    return 0
```

---

## 📊 Progress Tracking

| Phase | Status | Tracking |
|-------|--------|----------|
| Phase 1: Pydantic Models | 🚧 Not Started | |
| Phase 2: Script Integration | 🚧 Not Started | |
| Phase 3: Pre-Commit | ✅ Partial (hooks exist, need Pydantic) | |
| Phase 4: CI Pipeline | 🚧 Not Started | |
| Phase 5: Documentation | 🚧 Not Started | |

### Existing Work (Reference)
- [Schema Architecture Proposal](#issuecomment-3689161881)
- [Schema Implementation (opt/docstring_models.py)](#issuecomment-3689266997)
- [Cleanup Planning](#issuecomment-3689815120)

---

## 🔗 Related PRs

| PR | Description |
|----|-------------|
| #91 | Batch docstring update script |
| #94 | Pre-commit hooks for validation |
| #100-#112 | Category docstring updates (all 10 categories) |

---

## 📚 References

- [Pydantic v2 Documentation](https://docs.pydantic.dev/latest/)
- [datamodel-code-generator](https://github.com/koxudaxi/datamodel-code-generator)
- [JSON Schema Draft-07](https://json-schema.org/draft-07/json-schema-release-notes.html)
- [Google Python Style Guide - Docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
- [Pre-commit Hooks](https://pre-commit.com/)

---

## 🏷️ Priority

**High** - Foundation for consistent documentation across 120 algorithms.

## ⏱️ Estimated Effort

| Task | Time |
|------|------|
| Pydantic model generation | 2-3 hours |
| Script refactoring | 4-6 hours |
| Pre-commit integration | 1-2 hours |
| CI pipeline | 2-3 hours |
| Documentation | 2-3 hours |
| **Total** | **11-17 hours** |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema-Driven Documentation: Pydantic Models + Automated CI/Hook Enforcement #110

Documentation Infrastructure: Schema-Driven Validation with Pydantic & Automated Enforcement

🎯 Vision

📁 Current Assets

JSON Schemas (`docs/schemas/`)

Scripts (`scripts/`) - Currently Independent

🎯 Target Architecture

1. Pydantic Models from JSON Schema

2. Scripts Use Pydantic Models

3. Pre-Commit Hook Integration

4. CI Pipeline

✅ Acceptance Criteria

Phase 1: Pydantic Model Generation

Phase 2: Script Integration

Phase 3: Pre-Commit Enforcement

Phase 4: CI Pipeline

Phase 5: Documentation

🔧 Implementation Tasks

Task 1: Generate Pydantic Models

Task 2: Create DocstringParser

Task 3: Unified Pre-Commit Hook

📊 Progress Tracking

Existing Work (Reference)

🔗 Related PRs

📚 References

🏷️ Priority

⏱️ Estimated Effort

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Purpose	Lines
`docstring-schema.json`	Defines COCO/BBOB docstring structure (AlgorithmMetadata, Args, Attributes, etc.)	722
`vitepress-mapping-schema.json`	Maps docstring sections → VitePress rendering rules	372
`default-mapping.json`	Default configuration + transformation rules	503

Script	Purpose	Uses Schema?
`validate_optimizer_docs.py`	COCO/BBOB compliance	❌ Regex-based
`check_google_docstring_inline_descriptions.py`	Inline format	❌ Regex-based
`batch_update_docstrings.py`	Generate templates	❌ Hardcoded
`generate_docs.py`	VitePress generation	❌ Partial
`fix_docstring_indentation.py`	Fix indentation	❌ Regex-based
`fix_multiline_returns.py`	Fix Returns format	❌ Regex-based

Phase	Status	Tracking
Phase 1: Pydantic Models	🚧 Not Started
Phase 2: Script Integration	🚧 Not Started
Phase 3: Pre-Commit	✅ Partial (hooks exist, need Pydantic)
Phase 4: CI Pipeline	🚧 Not Started
Phase 5: Documentation	🚧 Not Started

PR	Description
#91	Batch docstring update script
#94	Pre-commit hooks for validation
#100-#112	Category docstring updates (all 10 categories)

Task	Time
Pydantic model generation	2-3 hours
Script refactoring	4-6 hours
Pre-commit integration	1-2 hours
CI pipeline	2-3 hours
Documentation	2-3 hours
Total	11-17 hours

Schema-Driven Documentation: Pydantic Models + Automated CI/Hook Enforcement #110

Description

Documentation Infrastructure: Schema-Driven Validation with Pydantic & Automated Enforcement

🎯 Vision

📁 Current Assets

JSON Schemas (docs/schemas/)

Scripts (scripts/) - Currently Independent

🎯 Target Architecture

1. Pydantic Models from JSON Schema

2. Scripts Use Pydantic Models

3. Pre-Commit Hook Integration

4. CI Pipeline

✅ Acceptance Criteria

Phase 1: Pydantic Model Generation

Phase 2: Script Integration

Phase 3: Pre-Commit Enforcement

Phase 4: CI Pipeline

Phase 5: Documentation

🔧 Implementation Tasks

Task 1: Generate Pydantic Models

Task 2: Create DocstringParser

Task 3: Unified Pre-Commit Hook

📊 Progress Tracking

Existing Work (Reference)

🔗 Related PRs

📚 References

🏷️ Priority

⏱️ Estimated Effort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

JSON Schemas (`docs/schemas/`)

Scripts (`scripts/`) - Currently Independent