-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add semantic model parsing architecture doc (docs/arch/3.3_Semantic_Models.md) #12765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,181 @@ | ||
| # Semantic Model Parsing | ||
|
|
||
| ## Overview | ||
|
|
||
| Semantic models are first-class resources in dbt-core that expose model data to MetricFlow for metric computation. They define the *entities*, *dimensions*, and *measures* of a model in terms the Semantic Layer can query. Parsing produces `SemanticModel` nodes in the manifest, which are later validated by `dbt_semantic_interfaces`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Soon to be out of date 😂 No change needed here yet, just found it entertaining |
||
|
|
||
| ## Two Authoring Formats | ||
|
|
||
| dbt-core supports two YAML formats for defining semantic models. Understanding the distinction is essential when debugging parsing or partial parsing issues. | ||
|
|
||
| ### v1: Standalone (top-level `semantic_models:` key) | ||
|
|
||
| Defined as an independent entry under a top-level `semantic_models:` key in any schema YAML file: | ||
|
|
||
| ```yaml | ||
| semantic_models: | ||
| - name: revenue | ||
| model: ref('fct_revenue') | ||
| entities: | ||
| - name: transaction | ||
| type: primary | ||
| dimensions: | ||
| - name: ds | ||
| type: time | ||
| type_params: | ||
| time_granularity: day | ||
| measures: | ||
| - name: revenue | ||
| agg: sum | ||
| expr: amount | ||
| ``` | ||
|
|
||
| Parsed by `SemanticModelParser.parse()` in `schema_yaml_readers.py`. The semantic model is a fully independent entry in the YAML; its `model: ref('...')` field links it to the referenced model node via `depends_on`. | ||
|
Comment on lines
+13
to
+33
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is v1 deprecated? I.e. do we want to no longer encourage the authoring of v1 metrics? If so we should probably note that in this file.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll update it with a note. The answer is that V2 YAML should be the default in all things going forward, but there are several specific situations where v1 supports things v2 does not, and we are not able to deprecate v1 at this time. |
||
|
|
||
| ### v2: Inline (on the `models:` entry) | ||
|
|
||
| Defined directly on a model entry under the `models:` key, with column-level `dimension` and `entity` annotations: | ||
|
|
||
| ```yaml | ||
| models: | ||
| - name: fct_revenue | ||
| semantic_model: true # or a config dict: {name: custom_sm_name, enabled: true, ...} | ||
| agg_time_dimension: ds | ||
| columns: | ||
| - name: transaction_id | ||
| entity: | ||
| name: transaction | ||
| type: primary | ||
| - name: ds | ||
| granularity: day | ||
| dimension: | ||
| name: ds | ||
| type: time | ||
| - name: revenue | ||
| # no dimension/entity — becomes a measure candidate | ||
| ``` | ||
|
|
||
| The semantic model is **not** a standalone YAML entry. It is created as a side effect of model patching during `SchemaParser.patch_node_properties()` in `schemas.py`, which calls `SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch()`. The v2 SM has no entry under `dict_from_yaml["semantic_models"]`. | ||
|
|
||
| **Key difference:** v1 SMs are elements of the `semantic_models:` key diff; v2 SMs are a byproduct of the `models:` key diff. This distinction matters for partial parsing (see below). | ||
|
|
||
| ## Key Files | ||
|
|
||
| | File | Role | | ||
| |---|---| | ||
| | `core/dbt/contracts/graph/unparsed.py` | `UnparsedSemanticModel` (v1 contract), `UnparsedSemanticModelConfig` / `UnparsedModelUpdate` (v2 contract) | | ||
| | `core/dbt/parser/schema_yaml_readers.py` | `SemanticModelParser` — `parse()` for v1, `parse_v2_semantic_model_from_dbt_model_patch()` for v2, shared `_parse_semantic_model_helper()` | | ||
| | `core/dbt/parser/schemas.py` | `SchemaParser.patch_node_properties()` — triggers v2 SM creation; `MetricParser.parse_v2_metrics_from_dbt_model_patch()` | | ||
| | `core/dbt/contracts/files.py` | `SchemaSourceFile` — tracks SM unique IDs and metrics per file | | ||
| | `core/dbt/parser/partial.py` | `PartialParsing` — handles SM lifecycle during incremental re-parse | | ||
| | `core/dbt/artifacts/resources/v1/semantic_layer_components.py` | `SemanticModel`, `Dimension`, `Entity`, `Measure` artifact definitions | | ||
|
|
||
| ## `SchemaSourceFile` Tracking Fields | ||
|
|
||
| `SchemaSourceFile` (in `files.py`) maintains per-file lists of parsed resource IDs. For semantic models and metrics: | ||
|
|
||
| - **`semantic_models: List[str]`** — unique IDs of all SMs in this file, both v1 and v2. v2 SM unique IDs are appended here when `_parse_semantic_model_helper()` runs. | ||
| - **`node_patches: List[str]`** (alias `ndp`) — unique IDs of model/seed/snapshot nodes patched by this file. A model with `semantic_model: true` will have its model node ID here. | ||
| - **`metrics_from_measures: Dict[str, List[str]]`** — auto-generated metrics keyed by semantic model name. Populated when `create_metric: true` (v1) or v2 simple metrics are generated from measures. | ||
| - **`metrics: List[str]`** — unique IDs of explicitly declared metrics in this file. | ||
| - **`generated_metrics: List[str]`** — legacy field; use `fix_metrics_from_measures()` to migrate to `metrics_from_measures`. | ||
|
|
||
| ## Parsing Flow | ||
|
|
||
| ### v1 Standalone | ||
|
|
||
| ``` | ||
| SchemaParser.parse_yaml() | ||
| └── SemanticModelParser.parse() | ||
| ├── reads UnparsedSemanticModel from YAML | ||
| ├── calls _parse_semantic_model_helper() | ||
| │ └── adds SemanticModel to manifest.semantic_models | ||
| │ └── appends unique_id to schema_file.semantic_models | ||
| └── optionally: MetricParser for create_metric measures | ||
| └── appends to schema_file.metrics_from_measures[sm_name] | ||
| ``` | ||
|
|
||
| ### v2 Inline | ||
|
|
||
| ``` | ||
| SchemaParser.parse_yaml() | ||
| └── ModelPatcher.parse_patch() | ||
| └── patch_node_properties(node, patch) [schemas.py] | ||
| ├── sets node.access, node.version, etc. | ||
| ├── if semantic_model_enabled: | ||
| │ SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch() | ||
| │ ├── _parse_v2_column_dimensions(patch.columns) | ||
| │ ├── _parse_v2_column_entities(patch.columns) | ||
| │ └── _parse_semantic_model_helper(model=f"ref('{patch.name}')", ...) | ||
| └── MetricParser.parse_v2_metrics_from_dbt_model_patch(patch) | ||
| ``` | ||
|
|
||
| The v2 SM's `model` field is always set to `f"ref('{model_name}')"` — this is the reliable way to identify which model a v2 SM was derived from. | ||
|
|
||
| ## Partial Parsing Considerations | ||
|
|
||
| ### v1 SMs — handled correctly | ||
|
|
||
| v1 SMs are diffed via the `semantic_models:` key in `handle_schema_file_changes()`. Added/changed/deleted v1 SM entries invoke `delete_schema_semantic_model()`, which removes the SM from the manifest and from `schema_file.semantic_models`, and cleans up `metrics_from_measures`. | ||
|
|
||
| ### v2 SMs — require special handling (DI-3697) | ||
|
|
||
| v2 SMs are **not** represented under `dict_from_yaml["semantic_models"]`, so the normal `semantic_models:` key diff never processes them. When a model entry is changed or deleted, `_delete_schema_mssa_links()` is called, which handles the model node and tests — but historically did not clean up the associated v2 SM. | ||
|
|
||
| **The fix (merged in DI-3697):** `_delete_schema_mssa_links()` now calls `_delete_v2_semantic_model_for_model()` for `dict_key == "models"`. This method: | ||
|
|
||
| 1. Computes `model_ref = f"ref('{model_name}')"` — the string `_parse_semantic_model_helper` stores in `sm.model` | ||
| 2. Collects names of v1 SMs from `schema_file.dict_from_yaml["semantic_models"]` to avoid touching them | ||
| 3. Iterates `schema_file.semantic_models`, finds entries where `sm.model == model_ref and sm.name not in v1_sm_names`, removes them and cleans up `metrics_from_measures` | ||
|
|
||
| **Distinguishing v1 from v2 SMs in the manifest:** A SM in `schema_file.semantic_models` is v2 if its name does **not** appear in `schema_file.dict_from_yaml.get("semantic_models", [])`. Equivalently, its `sm.model` will match `ref('<the_model_name>')`. | ||
|
|
||
| ### `_schedule_for_parsing` limitation | ||
|
|
||
| `schedule_nodes_for_parsing()` can schedule SMs for re-parse when their dependencies change (via `child_map`). However, it uses `_schedule_for_parsing("semantic_models", ...)` which looks up the SM in `schema_file.dict_from_yaml["semantic_models"]` — a lookup that silently fails for v2 SMs. If a v2 SM's children (e.g. saved queries) change and trigger a re-parse of the SM, this path will not find the SM to re-merge. This is a known limitation as of dbt 1.12. | ||
|
|
||
| ## Testing Patterns | ||
|
|
||
| ### Test locations | ||
|
|
||
| | Test type | Location | | ||
| |---|---| | ||
| | v1 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_parsing.py` | | ||
| | v1 partial parsing | `tests/functional/semantic_models/test_semantic_model_parsing.py` — `TestSemanticModelPartialParsing*` | | ||
| | v2 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` | | ||
| | v2 partial parsing | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` — `TestV2SemanticModel*PartialParsing*` | | ||
| | v2 column-level parsing | `tests/unit/parser/test_v2_column_semantic_parsing.py` | | ||
| | Partial parsing with metrics + SMs | `tests/functional/partial_parsing/test_pp_metrics.py` | | ||
|
|
||
| ### Functional test pattern for partial parsing | ||
|
|
||
| ```python | ||
| class TestV2SemanticModelPartialParsingChanged: | ||
| @pytest.fixture(scope="class") | ||
| def models(self): | ||
| return { | ||
| "schema.yml": some_v2_fixture_yml, | ||
| "fct_revenue.sql": fct_revenue_sql, | ||
| "metricflow_time_spine.sql": metricflow_time_spine_sql, | ||
| } | ||
|
|
||
| def test_partial_parsing_does_not_duplicate(self, project): | ||
| from dbt.tests.util import write_file | ||
|
|
||
| runner = dbtTestRunner() | ||
| result = runner.invoke(["parse"]) # full parse | ||
| assert result.success | ||
| assert len(result.result.semantic_models) == 1 | ||
|
|
||
| write_file(modified_yml, project.project_root, "models", "schema.yml") | ||
|
|
||
| result = runner.invoke(["parse"]) # partial parse | ||
| assert result.success | ||
| assert len(result.result.semantic_models) == 1 # not 2 | ||
| ``` | ||
|
|
||
| Key: the second `runner.invoke(["parse"])` uses the saved `partial_parse.msgpack` from the first run. Changing the YAML file on disk triggers partial parsing of that file's changed elements. | ||
|
|
||
| ### Fixtures | ||
|
|
||
| Shared YAML and SQL fixtures live in `tests/functional/semantic_models/fixtures.py`. v2 fixtures are named with the `_v2` suffix (e.g. `semantic_model_schema_yml_v2`, `base_schema_yml_v2`). The template fixture `semantic_model_schema_yml_v2_template_for_model_configs` uses a `{semantic_model_value}` placeholder for parameterizing the `semantic_model:` field value. | ||
Uh oh!
There was an error while loading. Please reload this page.