-
Notifications
You must be signed in to change notification settings - Fork 1.1k
SAO doc improvements #8234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAO doc improvements #8234
Changes from 16 commits
fa396d9
e22a2b6
2e0a572
f358e31
740f1b9
53aeeaf
ef09f5b
60ea3b4
7e356e0
451e1fb
cb640d8
0c06e69
aac82aa
11dfef3
b20620f
15defd7
8f60d4d
2bfc490
a1e1190
4249b04
153bcc1
aca2786
b6d035e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,10 +36,23 @@ State-aware orchestration does not depend on [static analysis](/docs/fusion/new- | |
|
|
||
| State-aware orchestration uses shared state tracking to determine which models need to be built by detecting changes in code or data every time a job runs. It also supports custom refresh intervals and custom source freshness configurations, so <Constant name="cloud" /> only rebuilds models when they're actually needed. | ||
|
|
||
| For example, you can configure your project so that <Constant name="cloud" /> skips rebuilding the dim_wizards model (and its parents) if they’ve already been refreshed within the last 4 hours, even if the job itself runs more frequently. | ||
| For example, you can configure your project so that <Constant name="cloud" /> skips rebuilding the `dim_wizards` model (and its parents) if they’ve already been refreshed within the last 4 hours, even if the job itself runs more frequently. | ||
|
|
||
| Without configuring anything, <Constant name="cloud" />'s state-aware orchestration automatically knows to build your models either when the code has changed or if there’s any new data in a source (or upstream model in the case of [dbt Mesh](/docs/mesh/about-mesh)). | ||
|
|
||
| ### Handling concurrent jobs | ||
|
|
||
| If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice (once per job) because each job detects a change that triggers a rebuild. However, if nothing has changed since the most recent build, neither job needs to rebuild `model_ab`. They will reuse the already built `model_ab` instead of rebuilding it again. | ||
|
|
||
| Under state-aware orchestration, all jobs read and write from the same shared state and build a model only when either the code or data state has changed. This means that each job individually evaulates whether a model needs rebuilding based on the model’s compiled code and upstream data state. | ||
|
|
||
| What happens when jobs overlap: | ||
|
|
||
| - If both jobs reach the same model at exactly the same time, one job waits until the other finishes. This is to prevent collisions in the data warehouse when two jobs try to build the same model at the same time. | ||
| - After the first job finishes, the second job still checks whether a rebuild for the model is needed. The job may choose to reuse the existing result or perform another build, depending on changes detected. | ||
|
||
|
|
||
| If you want to prevent a job from being built too frequently even when the code or data state has changed, you can reduce build frequency by using the `build_after` config. For information on how to use `build_after`, refer to [Model freshness](/reference/resource-configs/freshness) and [Advanced configurations](/docs/deploy/state-aware-setup#advanced-configurations). | ||
luna-bianca marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Efficient testing in state-aware orchestration <Lifecycle status="private_beta" /> | ||
|
|
||
| :::info Private beta feature | ||
|
|
@@ -134,8 +147,12 @@ The following section lists some considerations when using Efficient testing in | |
| store_failures: true | false | ||
| where: <string> | ||
| ``` | ||
|
|
||
| - **Efficient testing is available only in deploy jobs**. CI and merge jobs currently do not have the option to enable this feature. | ||
|
|
||
| ## Related FAQs | ||
|
|
||
| - **Efficient testing is available only in deploy jobs**. CI and merge jobs currently do not have the option to enable this feature. | ||
| <FAQ path="Runs/sao-difference-core" /> | ||
|
|
||
| ## Related docs | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -92,24 +92,23 @@ import DeleteJob from '/snippets/_delete-job.md'; | |
|
|
||
| By default, we use the warehouse metadata to check if sources (or upstream models in the case of Mesh) are fresh. For more advanced use cases, dbt provides other options that enable you to specify what gets run by state-aware orchestration. | ||
|
|
||
| You can customize with: | ||
| - `loaded_at_field`: Specify a specific column to use from the data. | ||
| You can use the following optional parameters to customize your state-aware orchestration: | ||
|
|
||
| - `loaded_at_query`: Define a custom freshness condition in SQL to account for partial loading or streaming data. | ||
| |Parameter | Description | Allowed values | Supports Jinja | | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Converted the parameter descriptions to a table format |
||
| |----------|-------------| -------------- | -------------- | | ||
| | `loaded_at_field` | Specifies a specific column to use from the data. | Name of timestamp column. For example, `created_at`, `"CAST(created_at AS TIMESTAMP)"`. | ✅ | | ||
| | `loaded_at_query` | Defines a custom freshness condition in SQL to account for partial loading or streaming data. | SQL string. For example, `"select {{ current_timestamp() }}"`. | ✅ | | ||
|
||
| | `build_after.count` | Determines how many units of time must pass before a model can be rebuilt to help reduce build frequency. | A positive integer or a Jinja expression. For example, `4` or `"{{ var('build_after_count', 4) }}"`. | ✅ | | ||
| | `build_after.period` | The time unit for the count to define the build interval. | `minute`, `hour`, `day`, or a Jinja expression (for example, `"{{ var('build_after_period', 'day') }}"`). | ✅ | | ||
| | `build_after.updates_on` | Determines whether a model rebuild is triggered when any upstream dependency has fresh data or only when all upstream dependencies are fresh. | <li>`any` (default) — Use this value when you want a downstream model to rebuild if _any_ of its upstream dependencies receives fresh data, even if others haven’t.</li> <li>`all` — Use this value when you want to trigger a rebuild only when _all_ upstream dependencies are fresh — minimizing unnecessary builds and reducing compute cost. Recommended to use in state-aware orchestration.</li> | ❌ | | ||
|
|
||
| If a source is a view in the data warehouse, dbt can’t track updates from the warehouse metadata when the view changes. Without a `loaded_at_field` or `loaded_at_query`, dbt treats the source as "always fresh” and emits a warning during freshness checks. To check freshness for sources that are views, add a `loaded_at_field` or `loaded_at_query` to your configuration. | ||
| Some notes when using `loaded_at_field` or `loaded_at_query`: | ||
| - You can either define `loaded_at_field` or `loaded_at_query` but not both. | ||
| - If a source is a view in the data warehouse, dbt can’t track updates from the warehouse metadata when the view changes. Without a `loaded_at_field` or `loaded_at_query`, dbt treats the source as "always fresh” and emits a warning during freshness checks. To check freshness for sources that are views, add a `loaded_at_field` or `loaded_at_query` to your configuration. | ||
|
|
||
| :::note | ||
| You can either define `loaded_at_field` or `loaded_at_query` but not both. | ||
| ::: | ||
| You can also customize with: | ||
| - `updates_on`: Change the default from `any` to `all` so it doesn’t build unless all upstreams have fresh data reducing compute even more. | ||
| - `build_after`: Don’t build a model more often than every x period to reduce build frequency when you need data less often than sources are fresh. | ||
|
|
||
|
|
||
| To learn more about model freshness and build after, refer to [model `freshness` config](/reference/resource-configs/freshness). To learn more about source and upstream model freshness configs, refer to [resource `freshness` config](/reference/resource-properties/freshness). | ||
| To learn more about model freshness and `build_after`, refer to [model `freshness` config](/reference/resource-configs/freshness). To learn more about source and upstream model freshness configs, refer to [resource `freshness` config](/reference/resource-properties/freshness). | ||
|
|
||
| ## Customizing behavior | ||
| ### Customizing behavior | ||
|
|
||
| You can optionally configure state-aware orchestration when you want to fine-tune orchestration behavior for these reasons: | ||
|
|
||
|
|
@@ -142,6 +141,37 @@ You can optionally configure state-aware orchestration when you want to fine-tun | |
| - `model/properties.yml` at the model level in YAML | ||
| - `model/model.sql` at the model level in SQL | ||
| These configurations are powerful because you can define a sensible default at the project level or for specific model folders, and override it for individual models or model groups that require more frequent updates. | ||
|
|
||
| ### Handling late-arriving data | ||
|
|
||
| If your incremental models use a lookback window to capture late-arriving data, make sure your freshness logic aligns with that window. | ||
|
||
|
|
||
| When you use a `loaded_at_field` or `loaded_at_query`, state-aware orchestration uses that value to determine whether new data has arrived. When the `loaded_at` value reflects an event timestamp (for example, `event_date`), late-arriving records may not update this value if the event occurred in the past. In these cases, state-aware orchestration may not trigger a rebuild, even though your incremental model’s lookback window would normally include those rows. | ||
|
|
||
| To ensure late-arriving data is detected by state-aware orchestration, your `loaded_at_field` or `loaded_at_query` should align with the same lookback window used in your incremental filter. See the following sample values for `loaded_at_field` and `loaded_at_query`: | ||
|
||
|
|
||
|
|
||
| <Tabs> | ||
| <TabItem value="loaded_at_field" label="loaded_at_field"> | ||
|
|
||
| ```yaml | ||
| loaded_at_field: ingested_at | ||
|
||
| ``` | ||
| </TabItem> | ||
|
|
||
| <TabItem value="loaded_at_query" label="loaded_at_query"> | ||
|
|
||
| ```yaml | ||
| loaded_at_query: | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we keep the same example from the incremental guide page? should it be {{this}} instead of source_table?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| select max(ingested_at) | ||
| from source_table | ||
| where ingested_at >= current_timestamp - interval '3 days' | ||
| ``` | ||
|
|
||
| </TabItem> | ||
| </Tabs> | ||
|
|
||
|
|
||
| ## Example | ||
|
|
||
| Let's use an example to illustrate how to customize our project so a model and its parent model are rebuilt only if they haven't been refreshed in the past 4 hours — even if a job runs more frequently than that. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| --- | ||
| title: How is state-aware orchestration different from using selectors in dbt Core? | ||
| description: "Compare how state-aware orchestration differs from using selectors in dbt Core" | ||
| sidebar_label: 'State-aware orchestration vs selectors in dbt Core' | ||
| id: sao-difference-core | ||
|
|
||
| --- | ||
|
|
||
| In <Constant name="core" /> , running with the selectors `state:modified+` and `source_status:fresher+` builds models that either: | ||
|
|
||
| - Have changed since the prior run (`state:modified+`) | ||
| - Have upstream sources that are fresher than in the prior run (`source_status:fresher+`) | ||
|
|
||
| Instead of relying only on these selectors and prior-run artifacts, state-aware orchestration decides whether to rebuild a model based on: | ||
|
|
||
| - Compiled SQL diffs that ignore non-meaningful changes like whitespace and comments | ||
| - Upstream data changes at runtime and model-level freshness settings | ||
| - Shared state across jobs | ||
|
|
||
| While <Constant name="core" /> uses selectors like `state:modified+` and `source_status:fresher+` to decide what to build _only for a single run in a single job_, state-aware orchestration with <Constant name="fusion" /> maintains a _shared, real-time model state across every job in the environment_ and uses that state to determine whether a model’s code or upstream data have actually changed before rebuilding. This ensures dbt only rebuilds models when something has changed, no matter which job runs them. |
Uh oh!
There was an error while loading. Please reload this page.