|
| 1 | +# Code Generator Design |
| 2 | + |
| 3 | +Code generator that produces documentation and code from Overture Maps Pydantic schema |
| 4 | +definitions. |
| 5 | + |
| 6 | +## Problem |
| 7 | + |
| 8 | +Overture Maps schema definitions live in Pydantic models across theme packages. Each |
| 9 | +model carries type annotations, field constraints, docstrings, and relationships |
| 10 | +(inheritance, composition, discriminated unions). Generating documentation or code from |
| 11 | +these models requires introspecting all of that structure and rendering it into output |
| 12 | +formats. |
| 13 | + |
| 14 | +Pydantic's internal representation is JSON-schema-oriented and discards the vocabulary |
| 15 | +the code generator needs to preserve. `model_json_schema()` flattens `FeatureVersion` (a |
| 16 | +NewType wrapping `int32` wrapping `Annotated[int, Field(ge=0, le=2^31-1)]`) to `{"type": |
| 17 | +"integer", "minimum": 0}` -- the NewType names `FeatureVersion` and `int32` are gone, |
| 18 | +custom constraint classes (`GeometryTypeConstraint`, `UniqueItemsConstraint`) are gone, |
| 19 | +Python class references are gone, and constraint provenance (which NewType contributed |
| 20 | +which bound) is gone. `FieldInfo.annotation` gives the raw annotation, but Pydantic does |
| 21 | +not unwrap NewType chains or track multi-depth constraint provenance. |
| 22 | + |
| 23 | +The schema's domain language -- custom primitives (`int32`, `float64`), semantic |
| 24 | +NewTypes (`FeatureVersion`, `Sources`), and custom constraint classes -- needs to |
| 25 | +survive extraction intact. A single field annotation like `NewType("Foo", |
| 26 | +Annotated[list[SomeModel] | None, Field(ge=0)])` encodes optionality, collection type, |
| 27 | +element type, constraints, and semantic naming in nested Python typing constructs. Type |
| 28 | +definitions regularly nest `Annotated` inside `NewType` inside `Annotated` -- |
| 29 | +`FeatureVersion = NewType("FeatureVersion", int32)` where `int32 = NewType("int32", |
| 30 | +Annotated[int, Field(ge=...)])` -- and constraints at each depth need to be tagged with |
| 31 | +the NewType that contributed them. |
| 32 | + |
| 33 | +The code generator solves this by extracting type information once into a flat, |
| 34 | +navigable representation (`TypeInfo`), then passing that to renderers that produce |
| 35 | +output without touching Python's type system. |
| 36 | + |
| 37 | +## Inputs and Outputs |
| 38 | + |
| 39 | +**Inputs**: Pydantic `BaseModel` subclasses discovered via `overture.models` entry |
| 40 | +points, plus example data from theme `pyproject.toml` files. Examples serve two |
| 41 | +purposes: rendered examples in documentation pages, and a starting point for generating |
| 42 | +tests that verify behavior of generated code. |
| 43 | + |
| 44 | +**Current Outputs**: Markdown documentation pages with field tables, cross-page links, |
| 45 | +constraint descriptions, and examples. |
| 46 | + |
| 47 | +**Planned outputs**: Arrow schemas, PySpark expressions. |
| 48 | + |
| 49 | +## Architecture |
| 50 | + |
| 51 | +Four layers with strict downward imports -- no layer references the one above it: |
| 52 | + |
| 53 | +```text |
| 54 | +Rendering Output formatting, all presentation decisions |
| 55 | + ^ |
| 56 | +Output Layout What to generate, where it goes, how outputs link |
| 57 | + ^ |
| 58 | +Extraction TypeInfo, FieldSpec, ModelSpec, EnumSpec, ... |
| 59 | + ^ |
| 60 | +Discovery discover_models() from overture-schema-core |
| 61 | +``` |
| 62 | + |
| 63 | +`markdown_pipeline.py` orchestrates the pipeline without I/O: it expands feature trees, |
| 64 | +collects supplementary types, builds placement registries, computes reverse references, |
| 65 | +and calls renderers -- returning `RenderedPage` objects. The CLI (`cli.py`) is a thin |
| 66 | +Click wrapper that calls `generate_markdown_pages()` and writes files to disk. |
| 67 | + |
| 68 | +```mermaid |
| 69 | +graph TD |
| 70 | + subgraph Discovery |
| 71 | + DM["discover_models()"] |
| 72 | + end |
| 73 | +
|
| 74 | + DM -->|"dict[ModelKey, type]"| EX |
| 75 | +
|
| 76 | + subgraph Extraction |
| 77 | + EX["type_analyzer / extractors"] |
| 78 | + EX -->|"ModelSpec, UnionSpec"| TREE["expand_model_tree()"] |
| 79 | + end |
| 80 | +
|
| 81 | + TREE -->|"FeatureSpec[]"| OL |
| 82 | +
|
| 83 | + subgraph "Output Layout" |
| 84 | + OL["type_collection"] |
| 85 | + OL -->|"SupplementarySpec{}"| PA["path_assignment"] |
| 86 | + PA -->|"dict[str, Path]"| LC["link_computation"] |
| 87 | + RR["reverse_references"] |
| 88 | + end |
| 89 | +
|
| 90 | + subgraph Rendering |
| 91 | + R["markdown_renderer"] |
| 92 | + TR["type_registry"] -.->|"type name resolution"| R |
| 93 | + end |
| 94 | +
|
| 95 | + subgraph Orchestration |
| 96 | + MP["markdown_pipeline"] |
| 97 | + end |
| 98 | +
|
| 99 | + OL --> MP |
| 100 | + LC --> MP |
| 101 | + RR --> MP |
| 102 | + MP --> R |
| 103 | + R -->|"RenderedPage[]"| MP |
| 104 | + MP -->|"list[RenderedPage]"| CLI["cli.py → disk"] |
| 105 | +``` |
| 106 | + |
| 107 | +## Extraction |
| 108 | + |
| 109 | +### `analyze_type` -- iterative type unwrapping |
| 110 | + |
| 111 | +`analyze_type(annotation)` is a single iterative function that peels type annotation |
| 112 | +layers in a fixed order, accumulating information into an `_UnwrapState`: |
| 113 | + |
| 114 | +1. **NewType**: Records the outermost name (user-facing semantic identity, e.g. |
| 115 | + `FeatureVersion`) and updates the "current" name (used for constraint provenance and |
| 116 | + as `base_type` at terminal) |
| 117 | +2. **Annotated**: Collects constraints from metadata, each tagged with whichever NewType |
| 118 | + was most recently entered. Extracts `Field.description` when present |
| 119 | +3. **Union**: Filters out `None` (marks optional), `Sentinel`, and `Literal` sentinel |
| 120 | + arms. If multiple concrete `BaseModel` arms remain, classifies as `UNION`; otherwise |
| 121 | + continues with the single remaining arm |
| 122 | +4. **list / dict**: Sets collection flags, continues into element types |
| 123 | +5. **Terminal**: Classifies as `PRIMITIVE`, `LITERAL`, `ENUM`, `MODEL`, or `UNION` |
| 124 | + |
| 125 | +The result is `TypeInfo` -- a flat dataclass that fully describes the unwrapped type: |
| 126 | +classification (`TypeKind`), optional/list/dict flags, accumulated constraints with |
| 127 | +provenance, NewType names, source type, literal values, and (for UNION kind) the tuple |
| 128 | +of concrete `BaseModel` member types. Dict types carry recursively analyzed `TypeInfo` |
| 129 | +for their key and value types. |
| 130 | + |
| 131 | +Multi-depth `Annotated` layers (common in practice, since NewTypes wrap `Annotated` |
| 132 | +types that wrap further NewTypes) are handled naturally by the loop -- each iteration |
| 133 | +processes the next wrapper. Constraints from each `Annotated` layer are tagged with the |
| 134 | +NewType active at that depth. |
| 135 | + |
| 136 | +### Extractors by domain |
| 137 | + |
| 138 | +Extraction is split by entity kind: |
| 139 | + |
| 140 | +- `model_extraction.py`: Pydantic model -> `ModelSpec` (fields in MRO-aware |
| 141 | + documentation order, alias-resolved names, model-level constraints) |
| 142 | +- `enum_extraction.py`: Enum class -> `EnumSpec` |
| 143 | +- `newtype_extraction.py`: NewType -> `NewTypeSpec` |
| 144 | +- `union_extraction.py`: Discriminated union alias -> `UnionSpec` |
| 145 | +- `primitive_extraction.py`: Numeric primitives -> `PrimitiveSpec` |
| 146 | + |
| 147 | +Each calls `analyze_type()` for field types. Tree expansion (`expand_model_tree()`) |
| 148 | +walks MODEL-kind fields to populate nested model references, with a shared cache and |
| 149 | +cycle detection (`starts_cycle=True`). |
| 150 | + |
| 151 | +### Unions and the FeatureSpec protocol |
| 152 | + |
| 153 | +Discriminated unions (e.g. `Segment = Annotated[Union[RoadSegment, ...], |
| 154 | +Discriminator(...)]`) are type aliases, not classes. `UnionSpec` captures the union |
| 155 | +structure: member types, discriminator field and value mapping, and a merged field list. |
| 156 | +Fields shared across all variants appear once; fields present in some variants are |
| 157 | +wrapped in `AnnotatedField` with `variant_sources` indicating which members contribute |
| 158 | +them. The common base class is identified so shared fields can be deduplicated. |
| 159 | + |
| 160 | +`FeatureSpec` is a `Protocol` satisfied by both `ModelSpec` and `UnionSpec`. Code that |
| 161 | +operates on "any top-level feature" -- tree expansion, supplementary type collection, |
| 162 | +rendering dispatch -- uses `FeatureSpec` rather than a concrete type, so union and model |
| 163 | +features flow through the same pipeline. |
| 164 | + |
| 165 | +### Constraints |
| 166 | + |
| 167 | +Field-level constraints come from `Annotated` metadata -- `Ge`, `Le`, `Interval`, custom |
| 168 | +constraint classes. Each is tagged with the NewType that contributed it via |
| 169 | +`ConstraintSource`. |
| 170 | + |
| 171 | +Model-level constraints come from decorators (`@require_any_of`, `@require_if`, |
| 172 | +`@forbid_if`) and are extracted via `ModelConstraint.get_model_constraints()`. |
| 173 | + |
| 174 | +## Output Layout |
| 175 | + |
| 176 | +Determines the full set of artifacts to generate, where each lives on disk, and how they |
| 177 | +reference each other. |
| 178 | + |
| 179 | +### Supplementary type collection |
| 180 | + |
| 181 | +`collect_all_supplementary_types()` walks the expanded field trees of all feature specs, |
| 182 | +extracting enums, semantic NewTypes, and sub-models that need their own output. Returns |
| 183 | +`dict[str, SupplementarySpec]`. |
| 184 | + |
| 185 | +### Module-mirrored output paths |
| 186 | + |
| 187 | +Output paths derive from the source Python module path relative to a computed schema |
| 188 | +root (`compute_schema_root()` finds the longest common prefix of all entry point module |
| 189 | +paths). `compute_output_dir()` maps a Python module to an output directory. Feature |
| 190 | +models land in their module-derived directory. Supplementary types land at their own |
| 191 | +module-derived path, with a `types/` segment inserted when they fall under a feature |
| 192 | +directory. |
| 193 | + |
| 194 | +### Link computation |
| 195 | + |
| 196 | +`LinkContext` carries the current output's path and the full type-to-path registry. When |
| 197 | +a renderer formats a type reference, it looks up the target in the registry and computes |
| 198 | +a relative path. Links exist only for types with registry entries, avoiding broken |
| 199 | +references to ungenerated outputs. |
| 200 | + |
| 201 | +### Reverse references |
| 202 | + |
| 203 | +`compute_reverse_references()` walks feature specs to build `dict[type_name, |
| 204 | +list[UsedByEntry]]` for "Used By" sections. |
| 205 | + |
| 206 | +## Rendering |
| 207 | + |
| 208 | +Renderers consume specs and own all presentation decisions -- formatting, casing, link |
| 209 | +syntax. Extraction and the type registry carry no presentation logic. |
| 210 | + |
| 211 | +### Type registry |
| 212 | + |
| 213 | +`type_registry.py` maps type names to per-target string representations via |
| 214 | +`TypeMapping`. `format_type_string()` wraps the resolved name with list/optional |
| 215 | +qualifiers. `is_semantic_newtype()` distinguishes NewTypes that deserve their own |
| 216 | +identity (like `FeatureVersion` wrapping `int32`) from pass-through aliases to |
| 217 | +registered primitives. |
| 218 | + |
| 219 | +### Markdown renderer |
| 220 | + |
| 221 | +Jinja2 templates for feature, enum, NewType, primitives, and geometry pages. |
| 222 | +`render_feature()` expands MODEL-kind fields inline with dot-notation (e.g., |
| 223 | +`sources[].dataset`), stopping at cycle boundaries. `format_type()` in |
| 224 | +`markdown_type_format.py` converts `TypeInfo` into link-aware display strings using |
| 225 | +`LinkContext`. |
| 226 | + |
| 227 | +### Constraint prose |
| 228 | + |
| 229 | +`field_constraint_description.py` and `model_constraint_description.py` convert |
| 230 | +constraint objects into human-readable descriptions. Field constraints produce inline |
| 231 | +text. Model constraints produce section-level descriptions and per-field notes, with |
| 232 | +consolidation for related conditional constraints (`require_if` / `forbid_if` grouped by |
| 233 | +trigger). |
| 234 | + |
| 235 | +### Example loader |
| 236 | + |
| 237 | +Loads example data from theme `pyproject.toml` files, validates against Pydantic models, |
| 238 | +and flattens to dot-notation rows for display in feature pages. Also provides a starting |
| 239 | +point for generated test data. |
| 240 | + |
| 241 | +## Extension Points |
| 242 | + |
| 243 | +**Adding a new output target** (Arrow schemas next, PySpark expressions after): Add a |
| 244 | +column to `TypeMapping` in `type_registry.py` for type-name resolution. Write a new |
| 245 | +renderer module that consumes specs and the type registry. The extraction layer and |
| 246 | +output layout are target-independent. |
| 247 | + |
| 248 | +**Adding a new type kind**: Add a variant to `TypeKind` in `type_analyzer.py`. Handle it |
| 249 | +in the terminal classification of `analyze_type()`. Add an extraction function and spec |
| 250 | +dataclass if needed. Update renderers to handle the new kind. |
| 251 | + |
| 252 | +**Adding a new constraint type**: The iterative unwrapper collects it automatically (any |
| 253 | +`Annotated` metadata becomes a `ConstraintSource`). Add a case to |
| 254 | +`describe_field_constraint()` for the prose representation. |
0 commit comments