Skip to content

Conversation

@luna-bianca
Copy link
Contributor

@luna-bianca luna-bianca commented Nov 28, 2025

@vercel
Copy link

vercel bot commented Nov 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
docs-getdbt-com Ready Ready Preview Jan 12, 2026 3:36pm

@github-actions github-actions bot added the content Improvements or additions to content label Nov 28, 2025
You can use the following optional parameters to customize your state-aware orchestration:

- `loaded_at_query`: Define a custom freshness condition in SQL to account for partial loading or streaming data.
|Parameter | Description | Allowed values | Supports Jinja |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted the parameter descriptions to a table format

@luna-bianca luna-bianca marked this pull request as ready for review December 4, 2025 17:10
@luna-bianca luna-bianca requested a review from a team as a code owner December 4, 2025 17:10
@luna-bianca luna-bianca requested a review from reubenmc December 4, 2025 17:11
Copy link

@reubenmc reubenmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luna-bianca! @evabgood and I just added some feedback. Things are starting to look great!

- 🕐 For example if most of our records for `2022-01-30` come in the raw schema of our warehouse on the morning of `2022-01-31`, but a handful don’t get loaded til `2022-02-02`, how might we tackle that? There will already be `max(updated_at)` timestamps of `2022-01-31` in the warehouse, filtering out those late records. **They’ll never make it to our model.**
- 🪟 To mitigate this, we can add a **lookback window** to our **cutoff** point. By **subtracting a few days** from the `max(updated_at)`, we would capture any late data within the window of what we subtracted.
- 👯 As long as we have a **`unique_key` defined in our config**, we’ll simply update existing rows and avoid duplication. We process more data this way, but in a fixed way, and it keeps our model hewing closer to the source data.
- If you're using state-aware orchestration, make sure its freshness detection logic accounts for late-arriving data. By default, dbt uses warehouse metadata, which is updated whenever new rows arrive, even if their event timestamps are in the past. However, if you configure a `loaded_at_field` or `loaded_at_query` that uses an event timestamp (for example, `event_date`), late-arriving data may not increase the `loaded_at` value. In this case, state-aware orchestration may skip rebuilding the incremental model, even though your lookback window would normally pick up those records. To ensure late-arriving data is detected, configure your `loaded_at_field` or `loaded_at_query` to align with the same lookback window used in your incremental filter.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs to be split into three cases as it's currently confusing. These are suggestions so please edit!

Using State-aware orchestration with Incremental Models

  1. By default, SAO uses dbt warehouse metadata to determine source freshness. This means that dbt will consider a source to have new data whenever a new row arrives. This could lead to running your models more often than ideal.
  2. To avoid this issue, you can instead tell dbt exactly which field to look at for freshess by configuing a loaded_at_field for a specific column or a loaded_at_query with custom SQL (LINK TO DOCS ON LOADED AT OPTIONS).
  3. Even with a loaded_at_field or loaded_at_query, late arriving records may have an earlier event timestamp. To ensure late-arriving data is detected, configure your loaded_at_field or loaded_at_query to align with the same lookback window used in your incremental filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

- Every macro, variable, or templated logic is resolved before state-aware orchestration checks for changes.
- If you use dynamic content (for example, `{{ run_started_at }}`), state-aware orchestration may detect that as a change even if the “static” SQL template hasn’t changed. This may result in more frequent model rebuilds.
- Any change to a macro definition or templated logic will be treated as a code change, even if the underlying data or SQL structure remains the same.
- If you want to leave comments in your source code but don’t want to trigger rebuilds, it is recommended to use regular SQL comments (for example, `-- This is a single-line comment in SQL`) in your query. State-aware orchestration ignores comment-only changes; such annotations will not force model rebuilds across the DAG.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently true, however this should change in a couple of weeks, so it's probably not worth updating right now. Instead, this should (once it goes out) be added to reflect the new behavior.

https://www.notion.so/dbtlabs/Code-changes-for-non-deterministic-SQL-2a4bb38ebda7807386f6ee38e5b0f892?source=copy_link

Detecting code changes

  1. We first look for changes in the pre-rendered SQL (like Mantle/Core does)
  2. iff there is a change, we look at the post-complied SQL (with whitespace and comments stripped out like we do for Fusion currently)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Detecting code changes section for now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good


### Handling concurrent jobs

If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify: only if something has changed though. If nothing has changes, then the second job will simply reuse model_ab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job.

Under state-aware orchestration, each job independently evaluates whether a model needs rebuilding based on the model’s compiled code and upstream data state. It does not enforce a single build per model across different jobs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. This is really more like:

Under state-aware orchestration, all job read and write from the same shared state and build a model only when either the code or data state has changed. This means that each job individually evaulates whether a model needs rebuilding based on the model’s compiled code and upstream data state.

Could also add: If you want to prevent a job from being built too frequently even when the code or data state has changed, you can slow down any model by using the build_after config (LINK TO DOCS ON HOW TO DO THIS).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Upstream data changes at runtime and model-level freshness settings
- Shared state across jobs

This helps avoid unnecessary rebuilds when underlying source files changed without changing the compiled logic, while still rebuilding when upstream data changes require it.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add: While Core did these for a single run in a single job, SAO with Fusion does this in real-time across every job in the enviroment to manage state and ensure you're not building any models when things haven't changed, no matter which job a model is built in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@reubenmc reubenmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks @luna-bianca! Good to merge this anytime. cc: @evabgood for vis.

What happens when jobs overlap:

- If both jobs reach the same model at exactly the same time, one job waits until the other finishes. This is to prevent collisions in the data warehouse when two jobs try to build the same model at the same time.
- After the first job finishes, the second job still checks whether a rebuild for the model is needed. The job may choose to reuse the existing result or perform another build, depending on changes detected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not use the language "may choose" here, as it's really more like a rule.

How about something like:
After the first job finishes building the model, the second job still checks whether a rebuild for the model is needed. If there are new data or code changes to incorporate, the model will be built, while if there are no changes and building it will produce the same result, the model will be reused.

|Parameter | Description | Allowed values | Supports Jinja |
|----------|-------------| -------------- | -------------- |
| `loaded_at_field` | Specifies a specific column to use from the data. | Name of timestamp column. For example, `created_at`, `"CAST(created_at AS TIMESTAMP)"`. ||
| `loaded_at_query` | Defines a custom freshness condition in SQL to account for partial loading or streaming data. | SQL string. For example, `"select {{ current_timestamp() }}"`. ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loaded_at_query - does the sql string need to be wrapped in quotes? how does it support multilines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added multi-line example here: aca2786


### Handling late-arriving data

If your incremental models use a lookback window to capture late-arriving data, make sure your freshness logic aligns with that window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


When you use a `loaded_at_field` or `loaded_at_query`, state-aware orchestration uses that value to determine whether new data has arrived. When the `loaded_at` value reflects an event timestamp (for example, `event_date`), late-arriving records may not update this value if the event occurred in the past. In these cases, state-aware orchestration may not trigger a rebuild, even though your incremental model’s lookback window would normally include those rows.

To ensure late-arriving data is detected by state-aware orchestration, your `loaded_at_field` or `loaded_at_query` should align with the same lookback window used in your incremental filter. See the following sample values for `loaded_at_field` and `loaded_at_query`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we show what the corresponding incremental filter would look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<TabItem value="loaded_at_query" label="loaded_at_query">
```yaml
loaded_at_query: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the same example from the incremental guide page? should it be {{this}} instead of source_table?

        select max(ingested_at)
        from {{ this }}
        where ingested_at >= current_timestamp - interval '3 days'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<TabItem value="loaded_at_field" label="loaded_at_field">

```yaml
loaded_at_field: ingested_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesnt account for the lookback period? i think the only really valid way to handle this is with the loaded_at_query, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry to have missed that! Modified the docs to instruct users to use loaded_at_query.

@luna-bianca luna-bianca merged commit 5234489 into current Jan 13, 2026
9 checks passed
@luna-bianca luna-bianca deleted the SAO-doc-improvements branch January 13, 2026 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

content Improvements or additions to content

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants