Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# AGENTS.md — AI Coding Agent Guidelines for dbt-core

## Architecture Documentation

Before investigating parsing bugs or adding new resource types, read the relevant doc in `docs/arch/`:

| Doc | Covers |
|---|---|
| `3_Parsing.md` | Full parse flow, `ManifestLoader`, `SchemaParser`, parser hierarchy |
| `3.1_Partial_Parsing.md` | Partial parse internals, `PartialParsing` class, file diff and change detection |
| `3.2_Deferral.md` | State-based deferral |
| `3.3_Semantic_Models.md` | Semantic model parsing (v1 standalone vs v2 inline), partial parsing edge cases, key files |

These docs describe where things live and how they connect — read them before doing exploratory code search.

## Project Overview

dbt-core is the open-source core of [dbt](https://www.getdbt.com/) (data build tool). It transforms data in warehouses by running SQL and Python models, managing dependencies, and producing artifacts. The main Python package lives in `core/` and is built with Hatch/Hatchling.
Expand Down
181 changes: 181 additions & 0 deletions docs/arch/3.3_Semantic_Models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Semantic Model Parsing

## Overview

Semantic models are first-class resources in dbt-core that expose model data to MetricFlow for metric computation. They define the *entities*, *dimensions*, and *measures* of a model in terms the Semantic Layer can query. Parsing produces `SemanticModel` nodes in the manifest, which are later validated by `dbt_semantic_interfaces`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which are later validated by dbt_semantic_interfaces

Soon to be out of date 😂 No change needed here yet, just found it entertaining


## Two Authoring Formats

dbt-core supports two YAML formats for defining semantic models. Understanding the distinction is essential when debugging parsing or partial parsing issues.

### v1: Standalone (top-level `semantic_models:` key)

Defined as an independent entry under a top-level `semantic_models:` key in any schema YAML file:

```yaml
semantic_models:
- name: revenue
model: ref('fct_revenue')
entities:
- name: transaction
type: primary
dimensions:
- name: ds
type: time
type_params:
time_granularity: day
measures:
- name: revenue
agg: sum
expr: amount
```

Parsed by `SemanticModelParser.parse()` in `schema_yaml_readers.py`. The semantic model is a fully independent entry in the YAML; its `model: ref('...')` field links it to the referenced model node via `depends_on`.
Comment on lines +13 to +33
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is v1 deprecated? I.e. do we want to no longer encourage the authoring of v1 metrics? If so we should probably note that in this file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update it with a note. The answer is that V2 YAML should be the default in all things going forward, but there are several specific situations where v1 supports things v2 does not, and we are not able to deprecate v1 at this time.


### v2: Inline (on the `models:` entry)

Defined directly on a model entry under the `models:` key, with column-level `dimension` and `entity` annotations:

```yaml
models:
- name: fct_revenue
semantic_model: true # or a config dict: {name: custom_sm_name, enabled: true, ...}
agg_time_dimension: ds
columns:
- name: transaction_id
entity:
name: transaction
type: primary
- name: ds
granularity: day
dimension:
name: ds
type: time
- name: revenue
# no dimension/entity — becomes a measure candidate
```

The semantic model is **not** a standalone YAML entry. It is created as a side effect of model patching during `SchemaParser.patch_node_properties()` in `schemas.py`, which calls `SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch()`. The v2 SM has no entry under `dict_from_yaml["semantic_models"]`.

**Key difference:** v1 SMs are elements of the `semantic_models:` key diff; v2 SMs are a byproduct of the `models:` key diff. This distinction matters for partial parsing (see below).

## Key Files

| File | Role |
|---|---|
| `core/dbt/contracts/graph/unparsed.py` | `UnparsedSemanticModel` (v1 contract), `UnparsedSemanticModelConfig` / `UnparsedModelUpdate` (v2 contract) |
| `core/dbt/parser/schema_yaml_readers.py` | `SemanticModelParser` — `parse()` for v1, `parse_v2_semantic_model_from_dbt_model_patch()` for v2, shared `_parse_semantic_model_helper()` |
| `core/dbt/parser/schemas.py` | `SchemaParser.patch_node_properties()` — triggers v2 SM creation; `MetricParser.parse_v2_metrics_from_dbt_model_patch()` |
| `core/dbt/contracts/files.py` | `SchemaSourceFile` — tracks SM unique IDs and metrics per file |
| `core/dbt/parser/partial.py` | `PartialParsing` — handles SM lifecycle during incremental re-parse |
| `core/dbt/artifacts/resources/v1/semantic_layer_components.py` | `SemanticModel`, `Dimension`, `Entity`, `Measure` artifact definitions |

## `SchemaSourceFile` Tracking Fields

`SchemaSourceFile` (in `files.py`) maintains per-file lists of parsed resource IDs. For semantic models and metrics:

- **`semantic_models: List[str]`** — unique IDs of all SMs in this file, both v1 and v2. v2 SM unique IDs are appended here when `_parse_semantic_model_helper()` runs.
- **`node_patches: List[str]`** (alias `ndp`) — unique IDs of model/seed/snapshot nodes patched by this file. A model with `semantic_model: true` will have its model node ID here.
- **`metrics_from_measures: Dict[str, List[str]]`** — auto-generated metrics keyed by semantic model name. Populated when `create_metric: true` (v1) or v2 simple metrics are generated from measures.
- **`metrics: List[str]`** — unique IDs of explicitly declared metrics in this file.
- **`generated_metrics: List[str]`** — legacy field; use `fix_metrics_from_measures()` to migrate to `metrics_from_measures`.

## Parsing Flow

### v1 Standalone

```
SchemaParser.parse_yaml()
└── SemanticModelParser.parse()
├── reads UnparsedSemanticModel from YAML
├── calls _parse_semantic_model_helper()
│ └── adds SemanticModel to manifest.semantic_models
│ └── appends unique_id to schema_file.semantic_models
└── optionally: MetricParser for create_metric measures
└── appends to schema_file.metrics_from_measures[sm_name]
```

### v2 Inline

```
SchemaParser.parse_yaml()
└── ModelPatcher.parse_patch()
└── patch_node_properties(node, patch) [schemas.py]
├── sets node.access, node.version, etc.
├── if semantic_model_enabled:
│ SemanticModelParser.parse_v2_semantic_model_from_dbt_model_patch()
│ ├── _parse_v2_column_dimensions(patch.columns)
│ ├── _parse_v2_column_entities(patch.columns)
│ └── _parse_semantic_model_helper(model=f"ref('{patch.name}')", ...)
└── MetricParser.parse_v2_metrics_from_dbt_model_patch(patch)
```

The v2 SM's `model` field is always set to `f"ref('{model_name}')"` — this is the reliable way to identify which model a v2 SM was derived from.

## Partial Parsing Considerations

### v1 SMs — handled correctly

v1 SMs are diffed via the `semantic_models:` key in `handle_schema_file_changes()`. Added/changed/deleted v1 SM entries invoke `delete_schema_semantic_model()`, which removes the SM from the manifest and from `schema_file.semantic_models`, and cleans up `metrics_from_measures`.

### v2 SMs — require special handling (DI-3697)

v2 SMs are **not** represented under `dict_from_yaml["semantic_models"]`, so the normal `semantic_models:` key diff never processes them. When a model entry is changed or deleted, `_delete_schema_mssa_links()` is called, which handles the model node and tests — but historically did not clean up the associated v2 SM.

**The fix (merged in DI-3697):** `_delete_schema_mssa_links()` now calls `_delete_v2_semantic_model_for_model()` for `dict_key == "models"`. This method:

1. Computes `model_ref = f"ref('{model_name}')"` — the string `_parse_semantic_model_helper` stores in `sm.model`
2. Collects names of v1 SMs from `schema_file.dict_from_yaml["semantic_models"]` to avoid touching them
3. Iterates `schema_file.semantic_models`, finds entries where `sm.model == model_ref and sm.name not in v1_sm_names`, removes them and cleans up `metrics_from_measures`

**Distinguishing v1 from v2 SMs in the manifest:** A SM in `schema_file.semantic_models` is v2 if its name does **not** appear in `schema_file.dict_from_yaml.get("semantic_models", [])`. Equivalently, its `sm.model` will match `ref('<the_model_name>')`.

### `_schedule_for_parsing` limitation

`schedule_nodes_for_parsing()` can schedule SMs for re-parse when their dependencies change (via `child_map`). However, it uses `_schedule_for_parsing("semantic_models", ...)` which looks up the SM in `schema_file.dict_from_yaml["semantic_models"]` — a lookup that silently fails for v2 SMs. If a v2 SM's children (e.g. saved queries) change and trigger a re-parse of the SM, this path will not find the SM to re-merge. This is a known limitation as of dbt 1.12.

## Testing Patterns

### Test locations

| Test type | Location |
|---|---|
| v1 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_parsing.py` |
| v1 partial parsing | `tests/functional/semantic_models/test_semantic_model_parsing.py` — `TestSemanticModelPartialParsing*` |
| v2 parsing (full parse) | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` |
| v2 partial parsing | `tests/functional/semantic_models/test_semantic_model_v2_parsing.py` — `TestV2SemanticModel*PartialParsing*` |
| v2 column-level parsing | `tests/unit/parser/test_v2_column_semantic_parsing.py` |
| Partial parsing with metrics + SMs | `tests/functional/partial_parsing/test_pp_metrics.py` |

### Functional test pattern for partial parsing

```python
class TestV2SemanticModelPartialParsingChanged:
@pytest.fixture(scope="class")
def models(self):
return {
"schema.yml": some_v2_fixture_yml,
"fct_revenue.sql": fct_revenue_sql,
"metricflow_time_spine.sql": metricflow_time_spine_sql,
}

def test_partial_parsing_does_not_duplicate(self, project):
from dbt.tests.util import write_file

runner = dbtTestRunner()
result = runner.invoke(["parse"]) # full parse
assert result.success
assert len(result.result.semantic_models) == 1

write_file(modified_yml, project.project_root, "models", "schema.yml")

result = runner.invoke(["parse"]) # partial parse
assert result.success
assert len(result.result.semantic_models) == 1 # not 2
```

Key: the second `runner.invoke(["parse"])` uses the saved `partial_parse.msgpack` from the first run. Changing the YAML file on disk triggers partial parsing of that file's changed elements.

### Fixtures

Shared YAML and SQL fixtures live in `tests/functional/semantic_models/fixtures.py`. v2 fixtures are named with the `_v2` suffix (e.g. `semantic_model_schema_yml_v2`, `base_schema_yml_v2`). The template fixture `semantic_model_schema_yml_v2_template_for_model_configs` uses a `{semantic_model_value}` placeholder for parameterizing the `semantic_model:` field value.
Loading