-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add semantic model parsing architecture doc (docs/arch/3.3_Semantic_Models.md) #12765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
cc7dd28
01a2690
36a9727
7e33d58
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| # Semantic Model Parsing | ||
|
|
||
| ## Overview | ||
|
|
||
| Semantic models are first-class resources in dbt-core that expose model data to MetricFlow for metric computation. They define the *entities*, *dimensions*, and *measures* of a model in terms the Semantic Layer can query. Parsing produces `SemanticModel` nodes in the manifest, which are later validated by `dbt_semantic_interfaces`. | ||
|
|
||
| ## Two Authoring Formats | ||
|
|
||
| dbt-core supports two YAML formats for defining semantic models. Understanding the distinction is essential when debugging parsing or partial parsing issues. | ||
|
|
||
| ### v1: Standalone (top-level `semantic_models:` key) | ||
|
|
||
| Defined as an independent entry under a top-level `semantic_models:` key in any schema YAML file: | ||
|
|
||
| ```yaml | ||
| semantic_models: | ||
| - name: revenue | ||
| model: ref('fct_revenue') | ||
| entities: | ||
| - name: transaction | ||
| type: primary | ||
| dimensions: | ||
| - name: ds | ||
| type: time | ||
| type_params: | ||
| time_granularity: day | ||
| measures: | ||
| - name: revenue | ||
| agg: sum | ||
| expr: amount | ||
| ``` | ||
|
|
||
| Parsed by `SemanticModelParser.parse()` in `schema_yaml_readers.py`. The semantic model is a fully independent entry in the YAML; its `model: ref('...')` field links it to the referenced model node via `depends_on`. | ||
|
Comment on lines
+15
to
+35
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is v1 deprecated? I.e. do we want to no longer encourage the authoring of v1 metrics? If so we should probably note that in this file.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll update it with a note. The answer is that V2 YAML should be the default in all things going forward, but there are several specific situations where v1 supports things v2 does not, and we are not able to deprecate v1 at this time.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a line specifying this. |
||
|
|
||
| ### v2: Inline (on the `models:` entry) | ||
|
|
||
| Defined directly on a model entry under the `models:` key, with column-level `dimension` and `entity` annotations: | ||
|
|
||
| ```yaml | ||
| models: | ||
| - name: fct_revenue | ||
| semantic_model: true # or a config dict: {name: custom_sm_name, enabled: true, ...} | ||
| agg_time_dimension: ds | ||
| columns: | ||
| - name: transaction_id | ||
| entity: | ||
| name: transaction | ||
| type: primary | ||
| - name: ds | ||
| granularity: day | ||
| dimension: | ||
| name: ds | ||
| type: time | ||
| - name: revenue | ||
| # no dimension/entity — becomes a measure candidate | ||
| ``` | ||
|
|
||
| The semantic model is **not** a standalone YAML entry. It is created as a side effect of model patching during `SchemaParser.patch_node_properties()` in `schemas.py`, which calls `SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch()`. The v2 SM has no entry under `dict_from_yaml["semantic_models"]`. | ||
|
|
||
| **Key difference:** v1 SMs are elements of the `semantic_models:` key diff; v2 SMs are a byproduct of the `models:` key diff. This distinction matters for partial parsing (see below). | ||
|
|
||
| ## Key Files | ||
|
|
||
| | File | Role | | ||
| |---|---| | ||
| | `core/dbt/contracts/graph/unparsed.py` | `UnparsedSemanticModel` (v1 contract), `UnparsedSemanticModelConfig` / `UnparsedModelUpdate` (v2 contract) | | ||
| | `core/dbt/parser/schema_yaml_readers.py` | `SemanticModelParser` — `parse()` for v1, `parse_v2_semantic_model_from_dbt_model_patch()` for v2, shared `_parse_semantic_model_helper()` | | ||
| | `core/dbt/parser/schemas.py` | `SchemaParser.patch_node_properties()` — triggers v2 SM creation; `MetricParser.parse_v2_metrics_from_dbt_model_patch()` | | ||
| | `core/dbt/contracts/files.py` | `SchemaSourceFile` — tracks SM unique IDs and metrics per file | | ||
| | `core/dbt/parser/partial.py` | `PartialParsing` — handles SM lifecycle during incremental re-parse | | ||
| | `core/dbt/artifacts/resources/v1/semantic_layer_components.py` | `SemanticModel`, `Dimension`, `Entity`, `Measure` artifact definitions | | ||
|
|
||
| ## `SchemaSourceFile` Tracking Fields | ||
|
|
||
| `SchemaSourceFile` (in `files.py`) maintains per-file lists of parsed resource IDs. For semantic models and metrics: | ||
|
|
||
| - **`semantic_models: List[str]`** — unique IDs of all SMs in this file, both v1 and v2. v2 SM unique IDs are appended here when `_parse_semantic_model_helper()` runs. | ||
| - **`node_patches: List[str]`** (alias `ndp`) — unique IDs of model/seed/snapshot nodes patched by this file. A model with `semantic_model: true` will have its model node ID here. | ||
| - **`metrics_from_measures: Dict[str, List[str]]`** — auto-generated metrics keyed by semantic model name. Populated when `create_metric: true` (v1) or v2 simple metrics are generated from measures. | ||
| - **`metrics: List[str]`** — unique IDs of explicitly declared metrics in this file. | ||
| - **`generated_metrics: List[str]`** — legacy field; use `fix_metrics_from_measures()` to migrate to `metrics_from_measures`. | ||
|
|
||
| ## Parsing Flow | ||
|
|
||
| ### v1 Standalone | ||
|
|
||
| ``` | ||
| SchemaParser.parse_yaml() | ||
| └── SemanticModelParser.parse() | ||
| ├── reads UnparsedSemanticModel from YAML | ||
| ├── calls _parse_semantic_model_helper() | ||
| │ └── adds SemanticModel to manifest.semantic_models | ||
| │ └── appends unique_id to schema_file.semantic_models | ||
| └── optionally: MetricParser for create_metric measures | ||
| └── appends to schema_file.metrics_from_measures[sm_name] | ||
| ``` | ||
|
|
||
| ### v2 Inline | ||
|
|
||
| ``` | ||
| SchemaParser.parse_yaml() | ||
| └── ModelPatcher.parse_patch() | ||
| └── patch_node_properties(node, patch) [schemas.py] | ||
| ├── sets node.access, node.version, etc. | ||
| ├── if semantic_model_enabled: | ||
| │ SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch() | ||
| │ ├── _parse_v2_column_dimensions(patch.columns) | ||
| │ ├── _parse_v2_column_entities(patch.columns) | ||
| │ └── _parse_semantic_model_helper(model=f"ref('{patch.name}')", ...) | ||
| └── MetricParser.parse_v2_metrics_from_dbt_model_patch(patch) | ||
| ``` | ||
|
|
||
| The v2 SM's `model` field is always set to `f"ref('{model_name}')"` — this is the reliable way to identify which model a v2 SM was derived from. | ||
|
|
||
| ## Partial Parsing Considerations | ||
|
|
||
| ### v1 SMs — handled correctly | ||
|
|
||
| v1 SMs are diffed via the `semantic_models:` key in `handle_schema_file_changes()`. Added/changed/deleted v1 SM entries invoke `delete_schema_semantic_model()`, which removes the SM from the manifest and from `schema_file.semantic_models`, and cleans up `metrics_from_measures`. | ||
|
|
||
| ### v2 SMs — require special handling (DI-3697) | ||
|
|
||
| v2 SMs are **not** represented under `dict_from_yaml["semantic_models"]`, so the normal `semantic_models:` key diff never processes them. When a model entry is changed or deleted, `_delete_schema_mssa_links()` is called, which handles the model node and tests — but historically did not clean up the associated v2 SM. | ||
|
|
||
| **The fix (merged in DI-3697):** `_delete_schema_mssa_links()` now calls `_delete_v2_semantic_model_for_model()` for `dict_key == "models"`. This method: | ||
|
|
||
| 1. Computes `model_ref = f"ref('{model_name}')"` — the string `_parse_semantic_model_helper` stores in `sm.model` | ||
| 2. Collects names of v1 SMs from `schema_file.dict_from_yaml["semantic_models"]` to avoid touching them | ||
| 3. Iterates `schema_file.semantic_models`, finds entries where `sm.model == model_ref and sm.name not in v1_sm_names`, removes them and cleans up `metrics_from_measures` | ||
|
|
||
| **Distinguishing v1 from v2 SMs in the manifest:** A SM in `schema_file.semantic_models` is v2 if its name does **not** appear in `schema_file.dict_from_yaml.get("semantic_models", [])`. Equivalently, its `sm.model` will match `ref('<the_model_name>')`. | ||
|
|
||
| ### `_schedule_for_parsing` limitation | ||
|
|
||
| `schedule_nodes_for_parsing()` can schedule SMs for re-parse when their dependencies change (via `child_map`). However, it uses `_schedule_for_parsing("semantic_models", ...)` which looks up the SM in `schema_file.dict_from_yaml["semantic_models"]` — a lookup that silently fails for v2 SMs. If a v2 SM's children (e.g. saved queries) change and trigger a re-parse of the SM, this path will not find the SM to re-merge. This is a known limitation as of dbt 1.12. | ||
|
|
||
| ## Testing Patterns | ||
|
|
||
| ### Test locations | ||
|
|
||
| | Test type | Location | | ||
| |---|---| | ||
| | v1 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_parsing.py` | | ||
| | v1 partial parsing | `tests/functional/semantic_models/test_semantic_model_parsing.py` — `TestSemanticModelPartialParsing*` | | ||
| | v2 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` | | ||
| | v2 partial parsing | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` — `TestV2SemanticModel*PartialParsing*` | | ||
| | v2 column-level parsing | `tests/unit/parser/test_v2_column_semantic_parsing.py` | | ||
| | Partial parsing with metrics + SMs | `tests/functional/partial_parsing/test_pp_metrics.py` | | ||
|
|
||
| ### Functional test pattern for partial parsing | ||
|
|
||
| ```python | ||
| class TestV2SemanticModelPartialParsingChanged: | ||
| @pytest.fixture(scope="class") | ||
| def models(self): | ||
| return { | ||
| "schema.yml": some_v2_fixture_yml, | ||
| "fct_revenue.sql": fct_revenue_sql, | ||
| "metricflow_time_spine.sql": metricflow_time_spine_sql, | ||
| } | ||
|
|
||
| def test_partial_parsing_does_not_duplicate(self, project): | ||
| from dbt.tests.util import write_file | ||
|
|
||
| runner = dbtTestRunner() | ||
| result = runner.invoke(["parse"]) # full parse | ||
| assert result.success | ||
| assert len(result.result.semantic_models) == 1 | ||
|
|
||
| write_file(modified_yml, project.project_root, "models", "schema.yml") | ||
|
|
||
| result = runner.invoke(["parse"]) # partial parse | ||
| assert result.success | ||
| assert len(result.result.semantic_models) == 1 # not 2 | ||
| ``` | ||
|
|
||
| Key: the second `runner.invoke(["parse"])` uses the saved `partial_parse.msgpack` from the first run. Changing the YAML file on disk triggers partial parsing of that file's changed elements. | ||
|
|
||
| ### Fixtures | ||
|
|
||
| Shared YAML and SQL fixtures live in `tests/functional/semantic_models/fixtures.py`. v2 fixtures are named with the `_v2` suffix (e.g. `semantic_model_schema_yml_v2`, `base_schema_yml_v2`). The template fixture `semantic_model_schema_yml_v2_template_for_model_configs` uses a `{semantic_model_value}` placeholder for parameterizing the `semantic_model:` field value. | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Troubleshooting: Semantic Layer Parse Failures](../troubleshooting/semantic_layer_parse_failures.md) — common causes of `dbt parse` errors for semantic models and metrics, and how to improve the error messages they produce. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| # Troubleshooting: Semantic Layer Parse Failures | ||
|
|
||
| This document covers common causes of `dbt parse` failures related to semantic | ||
| models and metrics, and how to fix or improve the errors produced. | ||
|
|
||
| ## Extra fields on YAML config objects produce vague errors | ||
|
|
||
| When a user adds an unrecognised field to a YAML config object (e.g. inside | ||
| `semantic_model:`, a `dimension:`, or a `metric:`), dbt's JSON Schema validator | ||
| rejects it but the default error message is unhelpful — it names the whole | ||
| object rather than the offending key: | ||
|
|
||
| ``` | ||
| Invalid models config given in models/schema.yml @ models: {...} - at path | ||
| ['semantic_model']: {...} is not valid under any of the given schemas | ||
| ``` | ||
|
|
||
| **How to improve the error:** Add a `validate()` classmethod to the relevant | ||
| `Unparsed*` dataclass in `core/dbt/contracts/graph/unparsed.py`. Compare | ||
| `cls.__dataclass_fields__` against the incoming `data` dict before calling | ||
| `super().validate(data)`, and raise a `ValidationError` that names the unknown | ||
| field(s) and lists the valid ones. `UnparsedSemanticModelConfig.validate()` is | ||
| the reference implementation. | ||
|
|
||
| When adding such a test, use `ContractTestCase.assert_fails_validation_with_message()` | ||
| (in `tests/unit/utils/__init__.py`) to assert both that validation fails *and* | ||
| that the error message is actionable. | ||
|
|
||
| If you need a clear PR example, refer to PR12766. | ||
|
|
||
| ## Union-typed fields produce even more vague errors | ||
|
|
||
| Several fields in `unparsed.py` use `Union[SomeConfig, bool, None]` (e.g. | ||
| `UnparsedModelUpdate.semantic_model`). When validation fails on the `SomeConfig` | ||
| branch, JSON Schema exhausts all branches of the `anyOf` and reports failure | ||
| against the union as a whole — giving no indication of which branch failed or | ||
| why: | ||
|
|
||
| ``` | ||
| at path ['semantic_model']: {'enabled': True, 'name': 'purchases', 'description': | ||
| '...'} is not valid under any of the given schemas | ||
| ``` | ||
|
|
||
| **How to improve the error:** The same `validate()` override approach works here. | ||
| By checking the sub-object's fields before `super().validate(data)` runs, the | ||
| specific error fires first and the opaque union failure is never reached. | ||
|
|
||
| ## Standalone simple metrics must be nested under the model entry | ||
|
|
||
| Simple v2 metrics must be written under the model entry (`models[].metrics`), | ||
| not as a top-level `metrics:` key. A top-level `metrics:` key is valid for | ||
| derived, conversion, and cumulative metrics — but **not** for simple ones. Using | ||
| it for a simple metric raises: | ||
|
|
||
| ``` | ||
| simple metrics in v2 YAML must be attached to semantic_model | ||
| ``` | ||
|
|
||
| Move the metrics with type 'simple' to a `metrics:` list to indented under the | ||
| model entry (same level as `columns:`) to fix this: | ||
|
|
||
| ```yaml | ||
| # Wrong — top-level metrics: key | ||
| models: | ||
| - name: fct_revenue | ||
| semantic_model: true | ||
| columns: ... | ||
|
|
||
| metrics: | ||
| - name: total_revenue # fails: simple metric cannot be standalone | ||
| type: simple | ||
| agg: sum | ||
| expr: revenue | ||
|
|
||
| # Right — metrics nested under the model entry | ||
| models: | ||
| - name: fct_revenue | ||
| semantic_model: true | ||
| columns: ... | ||
| metrics: | ||
| - name: total_revenue | ||
| type: simple | ||
| agg: sum | ||
| expr: revenue | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soon to be out of date 😂 No change needed here yet, just found it entertaining