diff --git a/AGENTS.md b/AGENTS.md index 2abd1954c56..661bd9ef391 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -59,6 +59,19 @@ Import order should follow isort conventions: 3. Third-party 4. dbt-internal (`dbt`, `dbt_common`, `dbt_adapters`, `dbt_extractor`, `dbt_semantic_interfaces`) +## Architecture Documentation + +Before investigating parsing bugs or adding new resource types, read the relevant doc in `docs/arch/`: + +| Doc | Covers | +|---|---| +| `3_Parsing.md` | Full parse flow, `ManifestLoader`, `SchemaParser`, parser hierarchy | +| `3.1_Partial_Parsing.md` | Partial parse internals, `PartialParsing` class, file diff and change detection | +| `3.2_Deferral.md` | State-based deferral | +| `3.3_Semantic_Models.md` | Semantic model parsing (v1 standalone vs v2 inline), partial parsing edge cases, key files | + +These docs describe where things live and how they connect — read them before doing exploratory code search. + ## Key Architectural Conventions ### Artifact Resources: Import from `dbt.artifacts.resources`, Not Versioned Paths diff --git a/docs/arch/3.3_Semantic_Models.md b/docs/arch/3.3_Semantic_Models.md new file mode 100644 index 00000000000..c11f44898ea --- /dev/null +++ b/docs/arch/3.3_Semantic_Models.md @@ -0,0 +1,187 @@ +# Semantic Model Parsing + +## Overview + +Semantic models are first-class resources in dbt-core that expose model data to MetricFlow for metric computation. They define the *entities*, *dimensions*, and *measures* of a model in terms the Semantic Layer can query. Parsing produces `SemanticModel` nodes in the manifest, which are later validated by `dbt_semantic_interfaces`. + +## Two Authoring Formats + +dbt-core supports two YAML formats for defining semantic models. Understanding the distinction is essential when debugging parsing or partial parsing issues. + +V2 is the preferred way to write your Semantic Layer, but V1 cannot be fully deprecated yet because it supports several use cases that V2 does not. + +### v1: Standalone (top-level `semantic_models:` key) + +Defined as an independent entry under a top-level `semantic_models:` key in any schema YAML file: + +```yaml +semantic_models: + - name: revenue + model: ref('fct_revenue') + entities: + - name: transaction + type: primary + dimensions: + - name: ds + type: time + type_params: + time_granularity: day + measures: + - name: revenue + agg: sum + expr: amount +``` + +Parsed by `SemanticModelParser.parse()` in `schema_yaml_readers.py`. The semantic model is a fully independent entry in the YAML; its `model: ref('...')` field links it to the referenced model node via `depends_on`. + +### v2: Inline (on the `models:` entry) + +Defined directly on a model entry under the `models:` key, with column-level `dimension` and `entity` annotations: + +```yaml +models: + - name: fct_revenue + semantic_model: true # or a config dict: {name: custom_sm_name, enabled: true, ...} + agg_time_dimension: ds + columns: + - name: transaction_id + entity: + name: transaction + type: primary + - name: ds + granularity: day + dimension: + name: ds + type: time + - name: revenue + # no dimension/entity — becomes a measure candidate +``` + +The semantic model is **not** a standalone YAML entry. It is created as a side effect of model patching during `SchemaParser.patch_node_properties()` in `schemas.py`, which calls `SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch()`. The v2 SM has no entry under `dict_from_yaml["semantic_models"]`. + +**Key difference:** v1 SMs are elements of the `semantic_models:` key diff; v2 SMs are a byproduct of the `models:` key diff. This distinction matters for partial parsing (see below). + +## Key Files + +| File | Role | +|---|---| +| `core/dbt/contracts/graph/unparsed.py` | `UnparsedSemanticModel` (v1 contract), `UnparsedSemanticModelConfig` / `UnparsedModelUpdate` (v2 contract) | +| `core/dbt/parser/schema_yaml_readers.py` | `SemanticModelParser` — `parse()` for v1, `parse_v2_semantic_model_from_dbt_model_patch()` for v2, shared `_parse_semantic_model_helper()` | +| `core/dbt/parser/schemas.py` | `SchemaParser.patch_node_properties()` — triggers v2 SM creation; `MetricParser.parse_v2_metrics_from_dbt_model_patch()` | +| `core/dbt/contracts/files.py` | `SchemaSourceFile` — tracks SM unique IDs and metrics per file | +| `core/dbt/parser/partial.py` | `PartialParsing` — handles SM lifecycle during incremental re-parse | +| `core/dbt/artifacts/resources/v1/semantic_layer_components.py` | `SemanticModel`, `Dimension`, `Entity`, `Measure` artifact definitions | + +## `SchemaSourceFile` Tracking Fields + +`SchemaSourceFile` (in `files.py`) maintains per-file lists of parsed resource IDs. For semantic models and metrics: + +- **`semantic_models: List[str]`** — unique IDs of all SMs in this file, both v1 and v2. v2 SM unique IDs are appended here when `_parse_semantic_model_helper()` runs. +- **`node_patches: List[str]`** (alias `ndp`) — unique IDs of model/seed/snapshot nodes patched by this file. A model with `semantic_model: true` will have its model node ID here. +- **`metrics_from_measures: Dict[str, List[str]]`** — auto-generated metrics keyed by semantic model name. Populated when `create_metric: true` (v1) or v2 simple metrics are generated from measures. +- **`metrics: List[str]`** — unique IDs of explicitly declared metrics in this file. +- **`generated_metrics: List[str]`** — legacy field; use `fix_metrics_from_measures()` to migrate to `metrics_from_measures`. + +## Parsing Flow + +### v1 Standalone + +``` +SchemaParser.parse_yaml() + └── SemanticModelParser.parse() + ├── reads UnparsedSemanticModel from YAML + ├── calls _parse_semantic_model_helper() + │ └── adds SemanticModel to manifest.semantic_models + │ └── appends unique_id to schema_file.semantic_models + └── optionally: MetricParser for create_metric measures + └── appends to schema_file.metrics_from_measures[sm_name] +``` + +### v2 Inline + +``` +SchemaParser.parse_yaml() + └── ModelPatcher.parse_patch() + └── patch_node_properties(node, patch) [schemas.py] + ├── sets node.access, node.version, etc. + ├── if semantic_model_enabled: + │ SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch() + │ ├── _parse_v2_column_dimensions(patch.columns) + │ ├── _parse_v2_column_entities(patch.columns) + │ └── _parse_semantic_model_helper(model=f"ref('{patch.name}')", ...) + └── MetricParser.parse_v2_metrics_from_dbt_model_patch(patch) +``` + +The v2 SM's `model` field is always set to `f"ref('{model_name}')"` — this is the reliable way to identify which model a v2 SM was derived from. + +## Partial Parsing Considerations + +### v1 SMs — handled correctly + +v1 SMs are diffed via the `semantic_models:` key in `handle_schema_file_changes()`. Added/changed/deleted v1 SM entries invoke `delete_schema_semantic_model()`, which removes the SM from the manifest and from `schema_file.semantic_models`, and cleans up `metrics_from_measures`. + +### v2 SMs — require special handling (DI-3697) + +v2 SMs are **not** represented under `dict_from_yaml["semantic_models"]`, so the normal `semantic_models:` key diff never processes them. When a model entry is changed or deleted, `_delete_schema_mssa_links()` is called, which handles the model node and tests — but historically did not clean up the associated v2 SM. + +**The fix (merged in DI-3697):** `_delete_schema_mssa_links()` now calls `_delete_v2_semantic_model_for_model()` for `dict_key == "models"`. This method: + +1. Computes `model_ref = f"ref('{model_name}')"` — the string `_parse_semantic_model_helper` stores in `sm.model` +2. Collects names of v1 SMs from `schema_file.dict_from_yaml["semantic_models"]` to avoid touching them +3. Iterates `schema_file.semantic_models`, finds entries where `sm.model == model_ref and sm.name not in v1_sm_names`, removes them and cleans up `metrics_from_measures` + +**Distinguishing v1 from v2 SMs in the manifest:** A SM in `schema_file.semantic_models` is v2 if its name does **not** appear in `schema_file.dict_from_yaml.get("semantic_models", [])`. Equivalently, its `sm.model` will match `ref('')`. + +### `_schedule_for_parsing` limitation + +`schedule_nodes_for_parsing()` can schedule SMs for re-parse when their dependencies change (via `child_map`). However, it uses `_schedule_for_parsing("semantic_models", ...)` which looks up the SM in `schema_file.dict_from_yaml["semantic_models"]` — a lookup that silently fails for v2 SMs. If a v2 SM's children (e.g. saved queries) change and trigger a re-parse of the SM, this path will not find the SM to re-merge. This is a known limitation as of dbt 1.12. + +## Testing Patterns + +### Test locations + +| Test type | Location | +|---|---| +| v1 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_parsing.py` | +| v1 partial parsing | `tests/functional/semantic_models/test_semantic_model_parsing.py` — `TestSemanticModelPartialParsing*` | +| v2 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` | +| v2 partial parsing | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` — `TestV2SemanticModel*PartialParsing*` | +| v2 column-level parsing | `tests/unit/parser/test_v2_column_semantic_parsing.py` | +| Partial parsing with metrics + SMs | `tests/functional/partial_parsing/test_pp_metrics.py` | + +### Functional test pattern for partial parsing + +```python +class TestV2SemanticModelPartialParsingChanged: + @pytest.fixture(scope="class") + def models(self): + return { + "schema.yml": some_v2_fixture_yml, + "fct_revenue.sql": fct_revenue_sql, + "metricflow_time_spine.sql": metricflow_time_spine_sql, + } + + def test_partial_parsing_does_not_duplicate(self, project): + from dbt.tests.util import write_file + + runner = dbtTestRunner() + result = runner.invoke(["parse"]) # full parse + assert result.success + assert len(result.result.semantic_models) == 1 + + write_file(modified_yml, project.project_root, "models", "schema.yml") + + result = runner.invoke(["parse"]) # partial parse + assert result.success + assert len(result.result.semantic_models) == 1 # not 2 +``` + +Key: the second `runner.invoke(["parse"])` uses the saved `partial_parse.msgpack` from the first run. Changing the YAML file on disk triggers partial parsing of that file's changed elements. + +### Fixtures + +Shared YAML and SQL fixtures live in `tests/functional/semantic_models/fixtures.py`. v2 fixtures are named with the `_v2` suffix (e.g. `semantic_model_schema_yml_v2`, `base_schema_yml_v2`). The template fixture `semantic_model_schema_yml_v2_template_for_model_configs` uses a `{semantic_model_value}` placeholder for parameterizing the `semantic_model:` field value. + +## See Also + +- [Troubleshooting: Semantic Layer Parse Failures](../troubleshooting/semantic_layer_parse_failures.md) — common causes of `dbt parse` errors for semantic models and metrics, and how to improve the error messages they produce. diff --git a/docs/troubleshooting/semantic_layer_parse_failures.md b/docs/troubleshooting/semantic_layer_parse_failures.md new file mode 100644 index 00000000000..8a8a3b853df --- /dev/null +++ b/docs/troubleshooting/semantic_layer_parse_failures.md @@ -0,0 +1,85 @@ +# Troubleshooting: Semantic Layer Parse Failures + +This document covers common causes of `dbt parse` failures related to semantic +models and metrics, and how to fix or improve the errors produced. + +## Extra fields on YAML config objects produce vague errors + +When a user adds an unrecognised field to a YAML config object (e.g. inside +`semantic_model:`, a `dimension:`, or a `metric:`), dbt's JSON Schema validator +rejects it but the default error message is unhelpful — it names the whole +object rather than the offending key: + +``` +Invalid models config given in models/schema.yml @ models: {...} - at path +['semantic_model']: {...} is not valid under any of the given schemas +``` + +**How to improve the error:** Add a `validate()` classmethod to the relevant +`Unparsed*` dataclass in `core/dbt/contracts/graph/unparsed.py`. Compare +`cls.__dataclass_fields__` against the incoming `data` dict before calling +`super().validate(data)`, and raise a `ValidationError` that names the unknown +field(s) and lists the valid ones. `UnparsedSemanticModelConfig.validate()` is +the reference implementation. + +When adding such a test, use `ContractTestCase.assert_fails_validation_with_message()` +(in `tests/unit/utils/__init__.py`) to assert both that validation fails *and* +that the error message is actionable. + +If you need a clear PR example, refer to PR12766. + +## Union-typed fields produce even more vague errors + +Several fields in `unparsed.py` use `Union[SomeConfig, bool, None]` (e.g. +`UnparsedModelUpdate.semantic_model`). When validation fails on the `SomeConfig` +branch, JSON Schema exhausts all branches of the `anyOf` and reports failure +against the union as a whole — giving no indication of which branch failed or +why: + +``` +at path ['semantic_model']: {'enabled': True, 'name': 'purchases', 'description': +'...'} is not valid under any of the given schemas +``` + +**How to improve the error:** The same `validate()` override approach works here. +By checking the sub-object's fields before `super().validate(data)` runs, the +specific error fires first and the opaque union failure is never reached. + +## Standalone simple metrics must be nested under the model entry + +Simple v2 metrics must be written under the model entry (`models[].metrics`), +not as a top-level `metrics:` key. A top-level `metrics:` key is valid for +derived, conversion, and cumulative metrics — but **not** for simple ones. Using +it for a simple metric raises: + +``` +simple metrics in v2 YAML must be attached to semantic_model +``` + +Move the metrics with type 'simple' to a `metrics:` list to indented under the +model entry (same level as `columns:`) to fix this: + +```yaml +# Wrong — top-level metrics: key +models: + - name: fct_revenue + semantic_model: true + columns: ... + +metrics: + - name: total_revenue # fails: simple metric cannot be standalone + type: simple + agg: sum + expr: revenue + +# Right — metrics nested under the model entry +models: + - name: fct_revenue + semantic_model: true + columns: ... + metrics: + - name: total_revenue + type: simple + agg: sum + expr: revenue +```