Skip to content

Commit dafd3d7

Browse files
committed
docs(codegen): add design doc, walkthrough, and README
Design doc covers the four-layer architecture, analyze_type(), domain-specific extractors, and extension points for new output targets. Walkthrough traces Segment through the full pipeline module-by-module in dependency order, with FeatureVersion as a secondary example for constraint provenance in the type analyzer. README describes the problem (Pydantic flattens domain vocabulary), the "unwrap once, render many" approach, CLI usage, architecture overview, and programmatic API.
1 parent 8cdcdd1 commit dafd3d7

File tree

3 files changed

+1070
-8
lines changed

3 files changed

+1070
-8
lines changed
Lines changed: 80 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,92 @@
11
# Overture Schema Codegen
22

3-
Code generator that produces documentation and code from Pydantic models.
3+
Generates documentation from Overture Maps Pydantic schema definitions.
44

5-
## Installation
5+
Pydantic's `model_json_schema()` flattens the schema's domain vocabulary into JSON
6+
Schema primitives. NewType names disappear, constraint provenance is lost (which NewType
7+
contributed which bound), custom constraint classes lose their identity (a
8+
`GeometryTypeConstraint` becomes an anonymous `enum` array), and discriminated union
9+
structure collapses into `anyOf` arrays with duplicated fields.
10+
11+
Navigating Python's type annotation machinery -- NewType chains, nested `Annotated`
12+
wrappers, union filtering, generic resolution -- is complex. The codegen does it once.
13+
`analyze_type()` unwraps annotations into `TypeInfo`, a flat target-independent
14+
representation. Extractors build specs from `TypeInfo`. Renderers consume specs without
15+
touching the type system. New output targets (Arrow schemas, PySpark expressions) add
16+
renderers, not extraction logic.
17+
18+
## Usage
619

720
```bash
8-
pip install overture-schema-codegen
21+
# Generate markdown documentation for all themes
22+
overture-codegen generate --format markdown --output-dir docs/schema/reference
23+
24+
# Generate for a single theme
25+
overture-codegen generate --format markdown --theme buildings --output-dir out/
26+
27+
# List discovered models
28+
overture-codegen list
929
```
1030

11-
## Usage
31+
The generator discovers models via `overture.models` entry points (provided by theme
32+
packages like `overture-schema-buildings-theme`), extracts type information, and renders
33+
output pages with cross-page links, constraint descriptions, and validated examples.
34+
35+
## Architecture
36+
37+
Four layers with strict downward imports -- no layer references the one above it:
38+
39+
```text
40+
Rendering Output formatting, all presentation decisions
41+
^
42+
Output Layout What to generate, where it goes, how outputs link
43+
^
44+
Extraction TypeInfo, FieldSpec, ModelSpec, UnionSpec
45+
^
46+
Discovery discover_models() from overture-schema-core
47+
```
48+
49+
**Discovery** loads registered Pydantic models via entry points. The return dict
50+
includes both concrete `BaseModel` subclasses (like `Building`) and discriminated union
51+
type aliases (like `Segment`). Both satisfy the `FeatureSpec` protocol and flow through
52+
the same pipeline.
53+
54+
**Extraction** unwraps type annotations into specs. `analyze_type()` is the central
55+
function -- a single iterative loop that peels NewType, Annotated, Union, and container
56+
wrappers, accumulating constraints tagged with the NewType that contributed them.
57+
Domain-specific extractors (`model_extraction`, `union_extraction`, `enum_extraction`,
58+
`newtype_extraction`, `primitive_extraction`) call `analyze_type()` for field types and
59+
produce spec dataclasses.
60+
61+
**Output Layout** determines what artifacts to generate and where they go. Supplementary
62+
type collection walks expanded feature trees to find referenced enums, NewTypes, and
63+
sub-models. Path assignment maps every type to an output file path mirroring the Python
64+
module structure. Link computation and reverse references enable cross-page navigation.
65+
66+
**Rendering** consumes specs and owns all presentation decisions. Markdown output uses
67+
Jinja2 templates for feature pages (with field tables, constraint sections, and
68+
examples), enum pages, NewType pages, and aggregate primitive/geometry reference pages.
69+
70+
`markdown_pipeline.py` orchestrates the full pipeline without I/O, returning
71+
`list[RenderedPage]`. The CLI writes files to disk with Docusaurus frontmatter.
72+
73+
## Programmatic use
1274

1375
```python
14-
from overture.schema.codegen import analyze_type, TypeInfo, TypeKind
76+
from overture.schema.codegen.type_analyzer import analyze_type, TypeKind
1577

16-
# Analyze a type annotation
17-
info = analyze_type(str)
18-
assert info.base_type == "str"
78+
info = analyze_type(some_annotation)
1979
assert info.kind == TypeKind.PRIMITIVE
80+
assert info.base_type == "int32"
81+
assert info.newtype_name == "FeatureVersion"
82+
# Constraints carry provenance:
83+
for cs in info.constraints:
84+
print(f"{cs.constraint} from {cs.source}")
2085
```
86+
87+
## Further reading
88+
89+
- [Design document](docs/design.md) -- architecture, extension points, data flow
90+
diagrams
91+
- [Walkthrough](docs/walkthrough.md) -- module-by-module narrative tracing Segment
92+
through the full pipeline
Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
# Code Generator Design
2+
3+
Code generator that produces documentation and code from Overture Maps Pydantic schema
4+
definitions.
5+
6+
## Problem
7+
8+
Overture Maps schema definitions live in Pydantic models across theme packages. Each
9+
model carries type annotations, field constraints, docstrings, and relationships
10+
(inheritance, composition, discriminated unions). Generating documentation or code from
11+
these models requires introspecting all of that structure and rendering it into output
12+
formats.
13+
14+
Pydantic's internal representation is JSON-schema-oriented and discards the vocabulary
15+
the code generator needs to preserve. `model_json_schema()` flattens `FeatureVersion` (a
16+
NewType wrapping `int32` wrapping `Annotated[int, Field(ge=0, le=2^31-1)]`) to `{"type":
17+
"integer", "minimum": 0}` -- the NewType names `FeatureVersion` and `int32` are gone,
18+
custom constraint classes (`GeometryTypeConstraint`, `UniqueItemsConstraint`) are gone,
19+
Python class references are gone, and constraint provenance (which NewType contributed
20+
which bound) is gone. `FieldInfo.annotation` gives the raw annotation, but Pydantic does
21+
not unwrap NewType chains or track multi-depth constraint provenance.
22+
23+
The schema's domain language -- custom primitives (`int32`, `float64`), semantic
24+
NewTypes (`FeatureVersion`, `Sources`), and custom constraint classes -- needs to
25+
survive extraction intact. A single field annotation like `NewType("Foo",
26+
Annotated[list[SomeModel] | None, Field(ge=0)])` encodes optionality, collection type,
27+
element type, constraints, and semantic naming in nested Python typing constructs. Type
28+
definitions regularly nest `Annotated` inside `NewType` inside `Annotated` --
29+
`FeatureVersion = NewType("FeatureVersion", int32)` where `int32 = NewType("int32",
30+
Annotated[int, Field(ge=...)])` -- and constraints at each depth need to be tagged with
31+
the NewType that contributed them.
32+
33+
The code generator solves this by extracting type information once into a flat,
34+
navigable representation (`TypeInfo`), then passing that to renderers that produce
35+
output without touching Python's type system.
36+
37+
## Inputs and Outputs
38+
39+
**Inputs**: Pydantic `BaseModel` subclasses discovered via `overture.models` entry
40+
points, plus example data from theme `pyproject.toml` files. Examples serve two
41+
purposes: rendered examples in documentation pages, and a starting point for generating
42+
tests that verify behavior of generated code.
43+
44+
**Current Outputs**: Markdown documentation pages with field tables, cross-page links,
45+
constraint descriptions, and examples.
46+
47+
**Planned outputs**: Arrow schemas, PySpark expressions.
48+
49+
## Architecture
50+
51+
Four layers with strict downward imports -- no layer references the one above it:
52+
53+
```text
54+
Rendering Output formatting, all presentation decisions
55+
^
56+
Output Layout What to generate, where it goes, how outputs link
57+
^
58+
Extraction TypeInfo, FieldSpec, ModelSpec, EnumSpec, ...
59+
^
60+
Discovery discover_models() from overture-schema-core
61+
```
62+
63+
`markdown_pipeline.py` orchestrates the pipeline without I/O: it expands feature trees,
64+
collects supplementary types, builds placement registries, computes reverse references,
65+
and calls renderers -- returning `RenderedPage` objects. The CLI (`cli.py`) is a thin
66+
Click wrapper that calls `generate_markdown_pages()` and writes files to disk.
67+
68+
```mermaid
69+
graph TD
70+
subgraph Discovery
71+
DM["discover_models()"]
72+
end
73+
74+
DM -->|"dict[ModelKey, type]"| EX
75+
76+
subgraph Extraction
77+
EX["type_analyzer / extractors"]
78+
EX -->|"ModelSpec, UnionSpec"| TREE["expand_model_tree()"]
79+
end
80+
81+
TREE -->|"FeatureSpec[]"| OL
82+
83+
subgraph "Output Layout"
84+
OL["type_collection"]
85+
OL -->|"SupplementarySpec{}"| PA["path_assignment"]
86+
PA -->|"dict[str, Path]"| LC["link_computation"]
87+
RR["reverse_references"]
88+
end
89+
90+
subgraph Rendering
91+
R["markdown_renderer"]
92+
TR["type_registry"] -.->|"type name resolution"| R
93+
end
94+
95+
subgraph Orchestration
96+
MP["markdown_pipeline"]
97+
end
98+
99+
OL --> MP
100+
LC --> MP
101+
RR --> MP
102+
MP --> R
103+
R -->|"RenderedPage[]"| MP
104+
MP -->|"list[RenderedPage]"| CLI["cli.py → disk"]
105+
```
106+
107+
## Extraction
108+
109+
### `analyze_type` -- iterative type unwrapping
110+
111+
`analyze_type(annotation)` is a single iterative function that peels type annotation
112+
layers in a fixed order, accumulating information into an `_UnwrapState`:
113+
114+
1. **NewType**: Records the outermost name (user-facing semantic identity, e.g.
115+
`FeatureVersion`) and updates the "current" name (used for constraint provenance and
116+
as `base_type` at terminal)
117+
2. **Annotated**: Collects constraints from metadata, each tagged with whichever NewType
118+
was most recently entered. Extracts `Field.description` when present
119+
3. **Union**: Filters out `None` (marks optional), `Sentinel`, and `Literal` sentinel
120+
arms. If multiple concrete `BaseModel` arms remain, classifies as `UNION`; otherwise
121+
continues with the single remaining arm
122+
4. **list / dict**: Sets collection flags, continues into element types
123+
5. **Terminal**: Classifies as `PRIMITIVE`, `LITERAL`, `ENUM`, `MODEL`, or `UNION`
124+
125+
The result is `TypeInfo` -- a flat dataclass that fully describes the unwrapped type:
126+
classification (`TypeKind`), optional/list/dict flags, accumulated constraints with
127+
provenance, NewType names, source type, literal values, and (for UNION kind) the tuple
128+
of concrete `BaseModel` member types. Dict types carry recursively analyzed `TypeInfo`
129+
for their key and value types.
130+
131+
Multi-depth `Annotated` layers (common in practice, since NewTypes wrap `Annotated`
132+
types that wrap further NewTypes) are handled naturally by the loop -- each iteration
133+
processes the next wrapper. Constraints from each `Annotated` layer are tagged with the
134+
NewType active at that depth.
135+
136+
### Extractors by domain
137+
138+
Extraction is split by entity kind:
139+
140+
- `model_extraction.py`: Pydantic model -> `ModelSpec` (fields in MRO-aware
141+
documentation order, alias-resolved names, model-level constraints)
142+
- `enum_extraction.py`: Enum class -> `EnumSpec`
143+
- `newtype_extraction.py`: NewType -> `NewTypeSpec`
144+
- `union_extraction.py`: Discriminated union alias -> `UnionSpec`
145+
- `primitive_extraction.py`: Numeric primitives -> `PrimitiveSpec`
146+
147+
Each calls `analyze_type()` for field types. Tree expansion (`expand_model_tree()`)
148+
walks MODEL-kind fields to populate nested model references, with a shared cache and
149+
cycle detection (`starts_cycle=True`).
150+
151+
### Unions and the FeatureSpec protocol
152+
153+
Discriminated unions (e.g. `Segment = Annotated[Union[RoadSegment, ...],
154+
Discriminator(...)]`) are type aliases, not classes. `UnionSpec` captures the union
155+
structure: member types, discriminator field and value mapping, and a merged field list.
156+
Fields shared across all variants appear once; fields present in some variants are
157+
wrapped in `AnnotatedField` with `variant_sources` indicating which members contribute
158+
them. The common base class is identified so shared fields can be deduplicated.
159+
160+
`FeatureSpec` is a `Protocol` satisfied by both `ModelSpec` and `UnionSpec`. Code that
161+
operates on "any top-level feature" -- tree expansion, supplementary type collection,
162+
rendering dispatch -- uses `FeatureSpec` rather than a concrete type, so union and model
163+
features flow through the same pipeline.
164+
165+
### Constraints
166+
167+
Field-level constraints come from `Annotated` metadata -- `Ge`, `Le`, `Interval`, custom
168+
constraint classes. Each is tagged with the NewType that contributed it via
169+
`ConstraintSource`.
170+
171+
Model-level constraints come from decorators (`@require_any_of`, `@require_if`,
172+
`@forbid_if`) and are extracted via `ModelConstraint.get_model_constraints()`.
173+
174+
## Output Layout
175+
176+
Determines the full set of artifacts to generate, where each lives on disk, and how they
177+
reference each other.
178+
179+
### Supplementary type collection
180+
181+
`collect_all_supplementary_types()` walks the expanded field trees of all feature specs,
182+
extracting enums, semantic NewTypes, and sub-models that need their own output. Returns
183+
`dict[str, SupplementarySpec]`.
184+
185+
### Module-mirrored output paths
186+
187+
Output paths derive from the source Python module path relative to a computed schema
188+
root (`compute_schema_root()` finds the longest common prefix of all entry point module
189+
paths). `compute_output_dir()` maps a Python module to an output directory. Feature
190+
models land in their module-derived directory. Supplementary types land at their own
191+
module-derived path, with a `types/` segment inserted when they fall under a feature
192+
directory.
193+
194+
### Link computation
195+
196+
`LinkContext` carries the current output's path and the full type-to-path registry. When
197+
a renderer formats a type reference, it looks up the target in the registry and computes
198+
a relative path. Links exist only for types with registry entries, avoiding broken
199+
references to ungenerated outputs.
200+
201+
### Reverse references
202+
203+
`compute_reverse_references()` walks feature specs to build `dict[type_name,
204+
list[UsedByEntry]]` for "Used By" sections.
205+
206+
## Rendering
207+
208+
Renderers consume specs and own all presentation decisions -- formatting, casing, link
209+
syntax. Extraction and the type registry carry no presentation logic.
210+
211+
### Type registry
212+
213+
`type_registry.py` maps type names to per-target string representations via
214+
`TypeMapping`. `format_type_string()` wraps the resolved name with list/optional
215+
qualifiers. `is_semantic_newtype()` distinguishes NewTypes that deserve their own
216+
identity (like `FeatureVersion` wrapping `int32`) from pass-through aliases to
217+
registered primitives.
218+
219+
### Markdown renderer
220+
221+
Jinja2 templates for feature, enum, NewType, primitives, and geometry pages.
222+
`render_feature()` expands MODEL-kind fields inline with dot-notation (e.g.,
223+
`sources[].dataset`), stopping at cycle boundaries. `format_type()` in
224+
`markdown_type_format.py` converts `TypeInfo` into link-aware display strings using
225+
`LinkContext`.
226+
227+
### Constraint prose
228+
229+
`field_constraint_description.py` and `model_constraint_description.py` convert
230+
constraint objects into human-readable descriptions. Field constraints produce inline
231+
text. Model constraints produce section-level descriptions and per-field notes, with
232+
consolidation for related conditional constraints (`require_if` / `forbid_if` grouped by
233+
trigger).
234+
235+
### Example loader
236+
237+
Loads example data from theme `pyproject.toml` files, validates against Pydantic models,
238+
and flattens to dot-notation rows for display in feature pages. Also provides a starting
239+
point for generated test data.
240+
241+
## Extension Points
242+
243+
**Adding a new output target** (Arrow schemas next, PySpark expressions after): Add a
244+
column to `TypeMapping` in `type_registry.py` for type-name resolution. Write a new
245+
renderer module that consumes specs and the type registry. The extraction layer and
246+
output layout are target-independent.
247+
248+
**Adding a new type kind**: Add a variant to `TypeKind` in `type_analyzer.py`. Handle it
249+
in the terminal classification of `analyze_type()`. Add an extraction function and spec
250+
dataclass if needed. Update renderers to handle the new kind.
251+
252+
**Adding a new constraint type**: The iterative unwrapper collects it automatically (any
253+
`Annotated` metadata becomes a `ConstraintSource`). Add a case to
254+
`describe_field_constraint()` for the prose representation.

0 commit comments

Comments
 (0)