Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
fa396d9
SAO doc improvements
luna-bianca Nov 28, 2025
e22a2b6
Edits
luna-bianca Nov 28, 2025
2e0a572
Update state-aware-setup.md
luna-bianca Nov 28, 2025
f358e31
More info
luna-bianca Dec 4, 2025
740f1b9
Merge branch 'current' into SAO-doc-improvements
luna-bianca Dec 4, 2025
53aeeaf
Add late-arriving data info
luna-bianca Dec 15, 2025
ef09f5b
Update state-aware-setup.md
luna-bianca Dec 15, 2025
60ea3b4
Merge branch 'current' into SAO-doc-improvements
luna-bianca Dec 17, 2025
7e356e0
Merge branch 'current' into SAO-doc-improvements
luna-bianca Dec 17, 2025
451e1fb
Merge branch 'current' into SAO-doc-improvements
luna-bianca Dec 18, 2025
cb640d8
Merge branch 'current' into SAO-doc-improvements
luna-bianca Jan 6, 2026
0c06e69
Apply comments from Eva and Reuben
luna-bianca Jan 9, 2026
aac82aa
Update state-aware-setup.md
luna-bianca Jan 9, 2026
11dfef3
Merge branch 'current' into SAO-doc-improvements
luna-bianca Jan 9, 2026
b20620f
Merge branch 'current' into SAO-doc-improvements
luna-bianca Jan 9, 2026
15defd7
Merge branch 'current' into SAO-doc-improvements
luna-bianca Jan 9, 2026
8f60d4d
Apply Reuben's comment
luna-bianca Jan 12, 2026
2bfc490
Merge branch 'current' into SAO-doc-improvements
mirnawong1 Jan 12, 2026
a1e1190
Update website/docs/docs/deploy/state-aware-about.md
luna-bianca Jan 12, 2026
4249b04
Update website/docs/docs/deploy/state-aware-about.md
luna-bianca Jan 12, 2026
153bcc1
Merge branch 'current' into SAO-doc-improvements
luna-bianca Jan 12, 2026
aca2786
Add multiline example
luna-bianca Jan 12, 2026
b6d035e
Address Katherine's comments
luna-bianca Jan 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -140,14 +140,35 @@ where

Fantastic! We’ve got a working incremental model. On our first run, when there is no corresponding table in the warehouse, `is_incremental` will evaluate to false and we’ll capture the entire table. On subsequent runs it will evaluate to true and we’ll apply our filter logic, capturing only the newer data.

### Late arriving facts
### Late-arriving facts

Our last concern specific to incremental models is what to do when data is inevitably loaded in a less-than-perfect way. Sometimes data loaders will, for a variety of reasons, load data late. Either an entire load comes in late, or some rows come in on a load after those with which they should have. The following is best practice for every incremental model to slow down the drift this can cause.

- 🕐 For example if most of our records for `2022-01-30` come in the raw schema of our warehouse on the morning of `2022-01-31`, but a handful don’t get loaded til `2022-02-02`, how might we tackle that? There will already be `max(updated_at)` timestamps of `2022-01-31` in the warehouse, filtering out those late records. **They’ll never make it to our model.**
- 🪟 To mitigate this, we can add a **lookback window** to our **cutoff** point. By **subtracting a few days** from the `max(updated_at)`, we would capture any late data within the window of what we subtracted.
- 👯 As long as we have a **`unique_key` defined in our config**, we’ll simply update existing rows and avoid duplication. We process more data this way, but in a fixed way, and it keeps our model hewing closer to the source data.


#### Using state-aware orchestration with incremental models

By default, [state-aware orchestration](/docs/deploy/state-aware-about) detects source freshness by checking warehouse metadata for any new rows. This may cause models to run more often than needed.

To avoid this issue, configure a `loaded_at_field` for a specific timestamp column or use a `loaded_at_query` with custom SQL to tell dbt which field to check for freshness. This helps state-aware orchestration to detect only genuinely new data. For information on how to configure `loaded_at_field` and `loaded_at_query`, refer to [Source freshness](/reference/resource-properties/freshness) and [Advanced configurations](/docs/deploy/state-aware-setup#advanced-configurations).

Even with a `loaded_at_field` or `loaded_at_query`, late arriving records may have an earlier event timestamp (for example, `event_date`). In this case, state-aware orchestration may skip rebuilding the incremental model, even though your lookback window would normally pick up those records. To ensure late-arriving data is detected, configure your `loaded_at_field` or `loaded_at_query` to align with the same lookback window used in your incremental filter. For example, if your incremental model uses a 3-day lookback window:

```yaml
sources:
- name: raw_orders
tables:
- name: orders
config:
loaded_at_query: |
select max(ingested_at)
from {{ this }}
where ingested_at >= current_timestamp - interval '3 days'
```

### Long-term considerations

Late arriving facts point to the biggest tradeoff with incremental models:
Expand Down
21 changes: 19 additions & 2 deletions website/docs/docs/deploy/state-aware-about.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,23 @@ State-aware orchestration does not depend on [static analysis](/docs/fusion/new-

State-aware orchestration uses shared state tracking to determine which models need to be built by detecting changes in code or data every time a job runs. It also supports custom refresh intervals and custom source freshness configurations, so <Constant name="cloud" /> only rebuilds models when they're actually needed.

For example, you can configure your project so that <Constant name="cloud" /> skips rebuilding the dim_wizards model (and its parents) if they’ve already been refreshed within the last 4 hours, even if the job itself runs more frequently.
For example, you can configure your project so that <Constant name="cloud" /> skips rebuilding the `dim_wizards` model (and its parents) if they’ve already been refreshed within the last 4 hours, even if the job itself runs more frequently.

Without configuring anything, <Constant name="cloud" />'s state-aware orchestration automatically knows to build your models either when the code has changed or if there’s any new data in a source (or upstream model in the case of [dbt Mesh](/docs/mesh/about-mesh)).

### Handling concurrent jobs

If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice (once per job) because each job detects a change that triggers a rebuild. However, if nothing has changed since the most recent build, neither job needs to rebuild `model_ab`. They will reuse the already built `model_ab` instead of rebuilding it again.

Under state-aware orchestration, all jobs read and write from the same shared state and build a model only when either the code or data state has changed. This means that each job individually evaulates whether a model needs rebuilding based on the model’s compiled code and upstream data state.

What happens when jobs overlap:

- If both jobs reach the same model at exactly the same time, one job waits until the other finishes. This is to prevent collisions in the data warehouse when two jobs try to build the same model at the same time.
- After the first job finishes, the second job still checks whether a rebuild for the model is needed. The job may choose to reuse the existing result or perform another build, depending on changes detected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not use the language "may choose" here, as it's really more like a rule.

How about something like:
After the first job finishes building the model, the second job still checks whether a rebuild for the model is needed. If there are new data or code changes to incorporate, the model will be built, while if there are no changes and building it will produce the same result, the model will be reused.


If you want to prevent a job from being built too frequently even when the code or data state has changed, you can reduce build frequency by using the `build_after` config. For information on how to use `build_after`, refer to [Model freshness](/reference/resource-configs/freshness) and [Advanced configurations](/docs/deploy/state-aware-setup#advanced-configurations).

## Efficient testing in state-aware orchestration <Lifecycle status="private_beta" />

:::info Private beta feature
Expand Down Expand Up @@ -134,8 +147,12 @@ The following section lists some considerations when using Efficient testing in
store_failures: true | false
where: <string>
```

- **Efficient testing is available only in deploy jobs**. CI and merge jobs currently do not have the option to enable this feature.

## Related FAQs

- **Efficient testing is available only in deploy jobs**. CI and merge jobs currently do not have the option to enable this feature.
<FAQ path="Runs/sao-difference-core" />

## Related docs

Expand Down
58 changes: 44 additions & 14 deletions website/docs/docs/deploy/state-aware-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,24 +92,23 @@ import DeleteJob from '/snippets/_delete-job.md';

By default, we use the warehouse metadata to check if sources (or upstream models in the case of Mesh) are fresh. For more advanced use cases, dbt provides other options that enable you to specify what gets run by state-aware orchestration.

You can customize with:
- `loaded_at_field`: Specify a specific column to use from the data.
You can use the following optional parameters to customize your state-aware orchestration:

- `loaded_at_query`: Define a custom freshness condition in SQL to account for partial loading or streaming data.
|Parameter | Description | Allowed values | Supports Jinja |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted the parameter descriptions to a table format

|----------|-------------| -------------- | -------------- |
| `loaded_at_field` | Specifies a specific column to use from the data. | Name of timestamp column. For example, `created_at`, `"CAST(created_at AS TIMESTAMP)"`. | ✅ |
| `loaded_at_query` | Defines a custom freshness condition in SQL to account for partial loading or streaming data. | SQL string. For example, `"select {{ current_timestamp() }}"`. | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loaded_at_query - does the sql string need to be wrapped in quotes? how does it support multilines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added multi-line example here: aca2786

| `build_after.count` | Determines how many units of time must pass before a model can be rebuilt to help reduce build frequency. | A positive integer or a Jinja expression. For example, `4` or `"{{ var('build_after_count', 4) }}"`. | ✅ |
| `build_after.period` | The time unit for the count to define the build interval. | `minute`, `hour`, `day`, or a Jinja expression (for example, `"{{ var('build_after_period', 'day') }}"`). | ✅ |
| `build_after.updates_on` | Determines whether a model rebuild is triggered when any upstream dependency has fresh data or only when all upstream dependencies are fresh. | <li>`any` (default) &mdash; Use this value when you want a downstream model to rebuild if _any_ of its upstream dependencies receives fresh data, even if others haven’t.</li> <li>`all` &mdash; Use this value when you want to trigger a rebuild only when _all_ upstream dependencies are fresh &mdash; minimizing unnecessary builds and reducing compute cost. Recommended to use in state-aware orchestration.</li> | ❌ |

If a source is a view in the data warehouse, dbt can’t track updates from the warehouse metadata when the view changes. Without a `loaded_at_field` or `loaded_at_query`, dbt treats the source as "always fresh” and emits a warning during freshness checks. To check freshness for sources that are views, add a `loaded_at_field` or `loaded_at_query` to your configuration.
Some notes when using `loaded_at_field` or `loaded_at_query`:
- You can either define `loaded_at_field` or `loaded_at_query` but not both.
- If a source is a view in the data warehouse, dbt can’t track updates from the warehouse metadata when the view changes. Without a `loaded_at_field` or `loaded_at_query`, dbt treats the source as "always fresh” and emits a warning during freshness checks. To check freshness for sources that are views, add a `loaded_at_field` or `loaded_at_query` to your configuration.

:::note
You can either define `loaded_at_field` or `loaded_at_query` but not both.
:::
You can also customize with:
- `updates_on`: Change the default from `any` to `all` so it doesn’t build unless all upstreams have fresh data reducing compute even more.
- `build_after`: Don’t build a model more often than every x period to reduce build frequency when you need data less often than sources are fresh.


To learn more about model freshness and build after, refer to [model `freshness` config](/reference/resource-configs/freshness). To learn more about source and upstream model freshness configs, refer to [resource `freshness` config](/reference/resource-properties/freshness).
To learn more about model freshness and `build_after`, refer to [model `freshness` config](/reference/resource-configs/freshness). To learn more about source and upstream model freshness configs, refer to [resource `freshness` config](/reference/resource-properties/freshness).

## Customizing behavior
### Customizing behavior

You can optionally configure state-aware orchestration when you want to fine-tune orchestration behavior for these reasons:

Expand Down Expand Up @@ -142,6 +141,37 @@ You can optionally configure state-aware orchestration when you want to fine-tun
- `model/properties.yml` at the model level in YAML
- `model/model.sql` at the model level in SQL
These configurations are powerful because you can define a sensible default at the project level or for specific model folders, and override it for individual models or model groups that require more frequent updates.

### Handling late-arriving data

If your incremental models use a lookback window to capture late-arriving data, make sure your freshness logic aligns with that window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


When you use a `loaded_at_field` or `loaded_at_query`, state-aware orchestration uses that value to determine whether new data has arrived. When the `loaded_at` value reflects an event timestamp (for example, `event_date`), late-arriving records may not update this value if the event occurred in the past. In these cases, state-aware orchestration may not trigger a rebuild, even though your incremental model’s lookback window would normally include those rows.

To ensure late-arriving data is detected by state-aware orchestration, your `loaded_at_field` or `loaded_at_query` should align with the same lookback window used in your incremental filter. See the following sample values for `loaded_at_field` and `loaded_at_query`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we show what the corresponding incremental filter would look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



<Tabs>
<TabItem value="loaded_at_field" label="loaded_at_field">

```yaml
loaded_at_field: ingested_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesnt account for the lookback period? i think the only really valid way to handle this is with the loaded_at_query, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry to have missed that! Modified the docs to instruct users to use loaded_at_query.

```
</TabItem>

<TabItem value="loaded_at_query" label="loaded_at_query">

```yaml
loaded_at_query: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the same example from the incremental guide page? should it be {{this}} instead of source_table?

        select max(ingested_at)
        from {{ this }}
        where ingested_at >= current_timestamp - interval '3 days'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select max(ingested_at)
from source_table
where ingested_at >= current_timestamp - interval '3 days'
```

</TabItem>
</Tabs>


## Example

Let's use an example to illustrate how to customize our project so a model and its parent model are rebuilt only if they haven't been refreshed in the past 4 hours &mdash; even if a job runs more frequently than that.
Expand Down
20 changes: 20 additions & 0 deletions website/docs/faqs/Runs/sao-difference-core.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: How is state-aware orchestration different from using selectors in dbt Core?
description: "Compare how state-aware orchestration differs from using selectors in dbt Core"
sidebar_label: 'State-aware orchestration vs selectors in dbt Core'
id: sao-difference-core

---

In <Constant name="core" /> , running with the selectors `state:modified+` and `source_status:fresher+` builds models that either:

- Have changed since the prior run (`state:modified+`)
- Have upstream sources that are fresher than in the prior run (`source_status:fresher+`)

Instead of relying only on these selectors and prior-run artifacts, state-aware orchestration decides whether to rebuild a model based on:

- Compiled SQL diffs that ignore non-meaningful changes like whitespace and comments
- Upstream data changes at runtime and model-level freshness settings
- Shared state across jobs

While <Constant name="core" /> uses selectors like `state:modified+` and `source_status:fresher+` to decide what to build _only for a single run in a single job_, state-aware orchestration with <Constant name="fusion" /> maintains a _shared, real-time model state across every job in the environment_ and uses that state to determine whether a model’s code or upstream data have actually changed before rebuilding. This ensures dbt only rebuilds models when something has changed, no matter which job runs them.
Loading