Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,19 @@ Import order should follow isort conventions:
3. Third-party
4. dbt-internal (`dbt`, `dbt_common`, `dbt_adapters`, `dbt_extractor`, `dbt_semantic_interfaces`)

## Architecture Documentation

Before investigating parsing bugs or adding new resource types, read the relevant doc in `docs/arch/`:

| Doc | Covers |
|---|---|
| `3_Parsing.md` | Full parse flow, `ManifestLoader`, `SchemaParser`, parser hierarchy |
| `3.1_Partial_Parsing.md` | Partial parse internals, `PartialParsing` class, file diff and change detection |
| `3.2_Deferral.md` | State-based deferral |
| `3.3_Semantic_Models.md` | Semantic model parsing (v1 standalone vs v2 inline), partial parsing edge cases, key files |

These docs describe where things live and how they connect — read them before doing exploratory code search.

## Key Architectural Conventions

### Artifact Resources: Import from `dbt.artifacts.resources`, Not Versioned Paths
Expand Down
185 changes: 185 additions & 0 deletions docs/arch/3.3_Semantic_Models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Semantic Model Parsing

## Overview

Semantic models are first-class resources in dbt-core that expose model data to MetricFlow for metric computation. They define the *entities*, *dimensions*, and *measures* of a model in terms the Semantic Layer can query. Parsing produces `SemanticModel` nodes in the manifest, which are later validated by `dbt_semantic_interfaces`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which are later validated by dbt_semantic_interfaces

Soon to be out of date 😂 No change needed here yet, just found it entertaining


## Two Authoring Formats

dbt-core supports two YAML formats for defining semantic models. Understanding the distinction is essential when debugging parsing or partial parsing issues.

### v1: Standalone (top-level `semantic_models:` key)

Defined as an independent entry under a top-level `semantic_models:` key in any schema YAML file:

```yaml
semantic_models:
- name: revenue
model: ref('fct_revenue')
entities:
- name: transaction
type: primary
dimensions:
- name: ds
type: time
type_params:
time_granularity: day
measures:
- name: revenue
agg: sum
expr: amount
```

Parsed by `SemanticModelParser.parse()` in `schema_yaml_readers.py`. The semantic model is a fully independent entry in the YAML; its `model: ref('...')` field links it to the referenced model node via `depends_on`.
Comment on lines +15 to +35
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is v1 deprecated? I.e. do we want to no longer encourage the authoring of v1 metrics? If so we should probably note that in this file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update it with a note. The answer is that V2 YAML should be the default in all things going forward, but there are several specific situations where v1 supports things v2 does not, and we are not able to deprecate v1 at this time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a line specifying this.


### v2: Inline (on the `models:` entry)

Defined directly on a model entry under the `models:` key, with column-level `dimension` and `entity` annotations:

```yaml
models:
- name: fct_revenue
semantic_model: true # or a config dict: {name: custom_sm_name, enabled: true, ...}
agg_time_dimension: ds
columns:
- name: transaction_id
entity:
name: transaction
type: primary
- name: ds
granularity: day
dimension:
name: ds
type: time
- name: revenue
# no dimension/entity — becomes a measure candidate
```

The semantic model is **not** a standalone YAML entry. It is created as a side effect of model patching during `SchemaParser.patch_node_properties()` in `schemas.py`, which calls `SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch()`. The v2 SM has no entry under `dict_from_yaml["semantic_models"]`.

**Key difference:** v1 SMs are elements of the `semantic_models:` key diff; v2 SMs are a byproduct of the `models:` key diff. This distinction matters for partial parsing (see below).

## Key Files

| File | Role |
|---|---|
| `core/dbt/contracts/graph/unparsed.py` | `UnparsedSemanticModel` (v1 contract), `UnparsedSemanticModelConfig` / `UnparsedModelUpdate` (v2 contract) |
| `core/dbt/parser/schema_yaml_readers.py` | `SemanticModelParser` — `parse()` for v1, `parse_v2_semantic_model_from_dbt_model_patch()` for v2, shared `_parse_semantic_model_helper()` |
| `core/dbt/parser/schemas.py` | `SchemaParser.patch_node_properties()` — triggers v2 SM creation; `MetricParser.parse_v2_metrics_from_dbt_model_patch()` |
| `core/dbt/contracts/files.py` | `SchemaSourceFile` — tracks SM unique IDs and metrics per file |
| `core/dbt/parser/partial.py` | `PartialParsing` — handles SM lifecycle during incremental re-parse |
| `core/dbt/artifacts/resources/v1/semantic_layer_components.py` | `SemanticModel`, `Dimension`, `Entity`, `Measure` artifact definitions |

## `SchemaSourceFile` Tracking Fields

`SchemaSourceFile` (in `files.py`) maintains per-file lists of parsed resource IDs. For semantic models and metrics:

- **`semantic_models: List[str]`** — unique IDs of all SMs in this file, both v1 and v2. v2 SM unique IDs are appended here when `_parse_semantic_model_helper()` runs.
- **`node_patches: List[str]`** (alias `ndp`) — unique IDs of model/seed/snapshot nodes patched by this file. A model with `semantic_model: true` will have its model node ID here.
- **`metrics_from_measures: Dict[str, List[str]]`** — auto-generated metrics keyed by semantic model name. Populated when `create_metric: true` (v1) or v2 simple metrics are generated from measures.
- **`metrics: List[str]`** — unique IDs of explicitly declared metrics in this file.
- **`generated_metrics: List[str]`** — legacy field; use `fix_metrics_from_measures()` to migrate to `metrics_from_measures`.

## Parsing Flow

### v1 Standalone

```
SchemaParser.parse_yaml()
└── SemanticModelParser.parse()
├── reads UnparsedSemanticModel from YAML
├── calls _parse_semantic_model_helper()
│ └── adds SemanticModel to manifest.semantic_models
│ └── appends unique_id to schema_file.semantic_models
└── optionally: MetricParser for create_metric measures
└── appends to schema_file.metrics_from_measures[sm_name]
```

### v2 Inline

```
SchemaParser.parse_yaml()
└── ModelPatcher.parse_patch()
└── patch_node_properties(node, patch) [schemas.py]
├── sets node.access, node.version, etc.
├── if semantic_model_enabled:
│ SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch()
│ ├── _parse_v2_column_dimensions(patch.columns)
│ ├── _parse_v2_column_entities(patch.columns)
│ └── _parse_semantic_model_helper(model=f"ref('{patch.name}')", ...)
└── MetricParser.parse_v2_metrics_from_dbt_model_patch(patch)
```

The v2 SM's `model` field is always set to `f"ref('{model_name}')"` — this is the reliable way to identify which model a v2 SM was derived from.

## Partial Parsing Considerations

### v1 SMs — handled correctly

v1 SMs are diffed via the `semantic_models:` key in `handle_schema_file_changes()`. Added/changed/deleted v1 SM entries invoke `delete_schema_semantic_model()`, which removes the SM from the manifest and from `schema_file.semantic_models`, and cleans up `metrics_from_measures`.

### v2 SMs — require special handling (DI-3697)

v2 SMs are **not** represented under `dict_from_yaml["semantic_models"]`, so the normal `semantic_models:` key diff never processes them. When a model entry is changed or deleted, `_delete_schema_mssa_links()` is called, which handles the model node and tests — but historically did not clean up the associated v2 SM.

**The fix (merged in DI-3697):** `_delete_schema_mssa_links()` now calls `_delete_v2_semantic_model_for_model()` for `dict_key == "models"`. This method:

1. Computes `model_ref = f"ref('{model_name}')"` — the string `_parse_semantic_model_helper` stores in `sm.model`
2. Collects names of v1 SMs from `schema_file.dict_from_yaml["semantic_models"]` to avoid touching them
3. Iterates `schema_file.semantic_models`, finds entries where `sm.model == model_ref and sm.name not in v1_sm_names`, removes them and cleans up `metrics_from_measures`

**Distinguishing v1 from v2 SMs in the manifest:** A SM in `schema_file.semantic_models` is v2 if its name does **not** appear in `schema_file.dict_from_yaml.get("semantic_models", [])`. Equivalently, its `sm.model` will match `ref('<the_model_name>')`.

### `_schedule_for_parsing` limitation

`schedule_nodes_for_parsing()` can schedule SMs for re-parse when their dependencies change (via `child_map`). However, it uses `_schedule_for_parsing("semantic_models", ...)` which looks up the SM in `schema_file.dict_from_yaml["semantic_models"]` — a lookup that silently fails for v2 SMs. If a v2 SM's children (e.g. saved queries) change and trigger a re-parse of the SM, this path will not find the SM to re-merge. This is a known limitation as of dbt 1.12.

## Testing Patterns

### Test locations

| Test type | Location |
|---|---|
| v1 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_parsing.py` |
| v1 partial parsing | `tests/functional/semantic_models/test_semantic_model_parsing.py` — `TestSemanticModelPartialParsing*` |
| v2 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` |
| v2 partial parsing | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` — `TestV2SemanticModel*PartialParsing*` |
| v2 column-level parsing | `tests/unit/parser/test_v2_column_semantic_parsing.py` |
| Partial parsing with metrics + SMs | `tests/functional/partial_parsing/test_pp_metrics.py` |

### Functional test pattern for partial parsing

```python
class TestV2SemanticModelPartialParsingChanged:
@pytest.fixture(scope="class")
def models(self):
return {
"schema.yml": some_v2_fixture_yml,
"fct_revenue.sql": fct_revenue_sql,
"metricflow_time_spine.sql": metricflow_time_spine_sql,
}

def test_partial_parsing_does_not_duplicate(self, project):
from dbt.tests.util import write_file

runner = dbtTestRunner()
result = runner.invoke(["parse"]) # full parse
assert result.success
assert len(result.result.semantic_models) == 1

write_file(modified_yml, project.project_root, "models", "schema.yml")

result = runner.invoke(["parse"]) # partial parse
assert result.success
assert len(result.result.semantic_models) == 1 # not 2
```

Key: the second `runner.invoke(["parse"])` uses the saved `partial_parse.msgpack` from the first run. Changing the YAML file on disk triggers partial parsing of that file's changed elements.

### Fixtures

Shared YAML and SQL fixtures live in `tests/functional/semantic_models/fixtures.py`. v2 fixtures are named with the `_v2` suffix (e.g. `semantic_model_schema_yml_v2`, `base_schema_yml_v2`). The template fixture `semantic_model_schema_yml_v2_template_for_model_configs` uses a `{semantic_model_value}` placeholder for parameterizing the `semantic_model:` field value.

## See Also

- [Troubleshooting: Semantic Layer Parse Failures](../troubleshooting/semantic_layer_parse_failures.md) — common causes of `dbt parse` errors for semantic models and metrics, and how to improve the error messages they produce.
85 changes: 85 additions & 0 deletions docs/troubleshooting/semantic_layer_parse_failures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Troubleshooting: Semantic Layer Parse Failures

This document covers common causes of `dbt parse` failures related to semantic
models and metrics, and how to fix or improve the errors produced.

## Extra fields on YAML config objects produce vague errors

When a user adds an unrecognised field to a YAML config object (e.g. inside
`semantic_model:`, a `dimension:`, or a `metric:`), dbt's JSON Schema validator
rejects it but the default error message is unhelpful — it names the whole
object rather than the offending key:

```
Invalid models config given in models/schema.yml @ models: {...} - at path
['semantic_model']: {...} is not valid under any of the given schemas
```

**How to improve the error:** Add a `validate()` classmethod to the relevant
`Unparsed*` dataclass in `core/dbt/contracts/graph/unparsed.py`. Compare
`cls.__dataclass_fields__` against the incoming `data` dict before calling
`super().validate(data)`, and raise a `ValidationError` that names the unknown
field(s) and lists the valid ones. `UnparsedSemanticModelConfig.validate()` is
the reference implementation.

When adding such a test, use `ContractTestCase.assert_fails_validation_with_message()`
(in `tests/unit/utils/__init__.py`) to assert both that validation fails *and*
that the error message is actionable.

If you need a clear PR example, refer to PR12766.

## Union-typed fields produce even more vague errors

Several fields in `unparsed.py` use `Union[SomeConfig, bool, None]` (e.g.
`UnparsedModelUpdate.semantic_model`). When validation fails on the `SomeConfig`
branch, JSON Schema exhausts all branches of the `anyOf` and reports failure
against the union as a whole — giving no indication of which branch failed or
why:

```
at path ['semantic_model']: {'enabled': True, 'name': 'purchases', 'description':
'...'} is not valid under any of the given schemas
```

**How to improve the error:** The same `validate()` override approach works here.
By checking the sub-object's fields before `super().validate(data)` runs, the
specific error fires first and the opaque union failure is never reached.

## Standalone simple metrics must be nested under the model entry

Simple v2 metrics must be written under the model entry (`models[].metrics`),
not as a top-level `metrics:` key. A top-level `metrics:` key is valid for
derived, conversion, and cumulative metrics — but **not** for simple ones. Using
it for a simple metric raises:

```
simple metrics in v2 YAML must be attached to semantic_model
```

Move the metrics with type 'simple' to a `metrics:` list to indented under the
model entry (same level as `columns:`) to fix this:

```yaml
# Wrong — top-level metrics: key
models:
- name: fct_revenue
semantic_model: true
columns: ...

metrics:
- name: total_revenue # fails: simple metric cannot be standalone
type: simple
agg: sum
expr: revenue

# Right — metrics nested under the model entry
models:
- name: fct_revenue
semantic_model: true
columns: ...
metrics:
- name: total_revenue
type: simple
agg: sum
expr: revenue
```
Loading