-
Notifications
You must be signed in to change notification settings - Fork 1.1k
SAO doc improvements #8234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAO doc improvements #8234
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| You can use the following optional parameters to customize your state-aware orchestration: | ||
|
|
||
| - `loaded_at_query`: Define a custom freshness condition in SQL to account for partial loading or streaming data. | ||
| |Parameter | Description | Allowed values | Supports Jinja | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted the parameter descriptions to a table format
reubenmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luna-bianca! @evabgood and I just added some feedback. Things are starting to look great!
| - 🕐 For example if most of our records for `2022-01-30` come in the raw schema of our warehouse on the morning of `2022-01-31`, but a handful don’t get loaded til `2022-02-02`, how might we tackle that? There will already be `max(updated_at)` timestamps of `2022-01-31` in the warehouse, filtering out those late records. **They’ll never make it to our model.** | ||
| - 🪟 To mitigate this, we can add a **lookback window** to our **cutoff** point. By **subtracting a few days** from the `max(updated_at)`, we would capture any late data within the window of what we subtracted. | ||
| - 👯 As long as we have a **`unique_key` defined in our config**, we’ll simply update existing rows and avoid duplication. We process more data this way, but in a fixed way, and it keeps our model hewing closer to the source data. | ||
| - If you're using state-aware orchestration, make sure its freshness detection logic accounts for late-arriving data. By default, dbt uses warehouse metadata, which is updated whenever new rows arrive, even if their event timestamps are in the past. However, if you configure a `loaded_at_field` or `loaded_at_query` that uses an event timestamp (for example, `event_date`), late-arriving data may not increase the `loaded_at` value. In this case, state-aware orchestration may skip rebuilding the incremental model, even though your lookback window would normally pick up those records. To ensure late-arriving data is detected, configure your `loaded_at_field` or `loaded_at_query` to align with the same lookback window used in your incremental filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this needs to be split into three cases as it's currently confusing. These are suggestions so please edit!
Using State-aware orchestration with Incremental Models
- By default, SAO uses dbt warehouse metadata to determine source freshness. This means that dbt will consider a source to have new data whenever a new row arrives. This could lead to running your models more often than ideal.
- To avoid this issue, you can instead tell dbt exactly which field to look at for freshess by configuing a
loaded_at_fieldfor a specific column or aloaded_at_querywith custom SQL (LINK TO DOCS ON LOADED AT OPTIONS). - Even with a
loaded_at_fieldorloaded_at_query, late arriving records may have an earlier event timestamp. To ensure late-arriving data is detected, configure yourloaded_at_fieldorloaded_at_queryto align with the same lookback window used in your incremental filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
| - Every macro, variable, or templated logic is resolved before state-aware orchestration checks for changes. | ||
| - If you use dynamic content (for example, `{{ run_started_at }}`), state-aware orchestration may detect that as a change even if the “static” SQL template hasn’t changed. This may result in more frequent model rebuilds. | ||
| - Any change to a macro definition or templated logic will be treated as a code change, even if the underlying data or SQL structure remains the same. | ||
| - If you want to leave comments in your source code but don’t want to trigger rebuilds, it is recommended to use regular SQL comments (for example, `-- This is a single-line comment in SQL`) in your query. State-aware orchestration ignores comment-only changes; such annotations will not force model rebuilds across the DAG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently true, however this should change in a couple of weeks, so it's probably not worth updating right now. Instead, this should (once it goes out) be added to reflect the new behavior.
Detecting code changes
- We first look for changes in the pre-rendered SQL (like Mantle/Core does)
- iff there is a change, we look at the post-complied SQL (with whitespace and comments stripped out like we do for Fusion currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Detecting code changes section for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
|
|
||
| ### Handling concurrent jobs | ||
|
|
||
| If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify: only if something has changed though. If nothing has changes, then the second job will simply reuse model_ab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job. | ||
|
|
||
| Under state-aware orchestration, each job independently evaluates whether a model needs rebuilding based on the model’s compiled code and upstream data state. It does not enforce a single build per model across different jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. This is really more like:
Under state-aware orchestration, all job read and write from the same shared state and build a model only when either the code or data state has changed. This means that each job individually evaulates whether a model needs rebuilding based on the model’s compiled code and upstream data state.
Could also add: If you want to prevent a job from being built too frequently even when the code or data state has changed, you can slow down any model by using the build_after config (LINK TO DOCS ON HOW TO DO THIS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And added build_after paragraph here: https://github.com/dbt-labs/docs.getdbt.com/pull/8234/changes#diff-ad798a159c003c98c28f29456ba1d0e295b58d33c976f5ed18c07c567f822080R54
| - Upstream data changes at runtime and model-level freshness settings | ||
| - Shared state across jobs | ||
|
|
||
| This helps avoid unnecessary rebuilds when underlying source files changed without changing the compiled logic, while still rebuilding when upstream data changes require it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add: While Core did these for a single run in a single job, SAO with Fusion does this in real-time across every job in the enviroment to manage state and ensure you're not building any models when things haven't changed, no matter which job a model is built in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reubenmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks @luna-bianca! Good to merge this anytime. cc: @evabgood for vis.
| What happens when jobs overlap: | ||
|
|
||
| - If both jobs reach the same model at exactly the same time, one job waits until the other finishes. This is to prevent collisions in the data warehouse when two jobs try to build the same model at the same time. | ||
| - After the first job finishes, the second job still checks whether a rebuild for the model is needed. The job may choose to reuse the existing result or perform another build, depending on changes detected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not use the language "may choose" here, as it's really more like a rule.
How about something like:
After the first job finishes building the model, the second job still checks whether a rebuild for the model is needed. If there are new data or code changes to incorporate, the model will be built, while if there are no changes and building it will produce the same result, the model will be reused.
| |Parameter | Description | Allowed values | Supports Jinja | | ||
| |----------|-------------| -------------- | -------------- | | ||
| | `loaded_at_field` | Specifies a specific column to use from the data. | Name of timestamp column. For example, `created_at`, `"CAST(created_at AS TIMESTAMP)"`. | ✅ | | ||
| | `loaded_at_query` | Defines a custom freshness condition in SQL to account for partial loading or streaming data. | SQL string. For example, `"select {{ current_timestamp() }}"`. | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for loaded_at_query - does the sql string need to be wrapped in quotes? how does it support multilines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added multi-line example here: aca2786
Co-authored-by: Mirna Wong <[email protected]>
Co-authored-by: Mirna Wong <[email protected]>
|
|
||
| ### Handling late-arriving data | ||
|
|
||
| If your incremental models use a lookback window to capture late-arriving data, make sure your freshness logic aligns with that window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we link to here to connect the two concepts? https://docs-getdbt-com-git-sao-doc-improvements-dbt-labs.vercel.app/best-practices/materializations/4-incremental-models#late-arriving-facts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| When you use a `loaded_at_field` or `loaded_at_query`, state-aware orchestration uses that value to determine whether new data has arrived. When the `loaded_at` value reflects an event timestamp (for example, `event_date`), late-arriving records may not update this value if the event occurred in the past. In these cases, state-aware orchestration may not trigger a rebuild, even though your incremental model’s lookback window would normally include those rows. | ||
|
|
||
| To ensure late-arriving data is detected by state-aware orchestration, your `loaded_at_field` or `loaded_at_query` should align with the same lookback window used in your incremental filter. See the following sample values for `loaded_at_field` and `loaded_at_query`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we show what the corresponding incremental filter would look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <TabItem value="loaded_at_query" label="loaded_at_query"> | ||
| ```yaml | ||
| loaded_at_query: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep the same example from the incremental guide page? should it be {{this}} instead of source_table?
select max(ingested_at)
from {{ this }}
where ingested_at >= current_timestamp - interval '3 days'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <TabItem value="loaded_at_field" label="loaded_at_field"> | ||
|
|
||
| ```yaml | ||
| loaded_at_field: ingested_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesnt account for the lookback period? i think the only really valid way to handle this is with the loaded_at_query, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry to have missed that! Modified the docs to instruct users to use loaded_at_query.
What are you changing in this pull request and why?
Slack thread 1
Slack thread 2
Previews:
Checklist
🚀 Deployment available! Here are the direct links to the updated files: