You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[OSO-921] add documentation on dagster about SQLMesh and Seed data (#5069)
* docs(dagster): add documentation on dagster about SQLMesh and Seed data
* docs(dagster): fix PR comments
* docs(dagster): add doc about build time in dagster
Copy file name to clipboardExpand all lines: apps/docs/docs/contribute-data/setup/index.md
+38-5Lines changed: 38 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -130,10 +130,11 @@ For more details on contributing to OSO, check out
130
130
131
131
### Verify deployment
132
132
133
-
Our Dagster deployment should automatically recognize the asset
134
-
after merging your pull request to the main branch.
135
-
You should be able to find your new asset
136
-
in the [global asset list](https://dagster.opensource.observer/assets).
133
+
After your pull request is merged into the `main` branch, a new Dagster deployment is automatically triggered. This process typically takes 10-15 minutes.
134
+
135
+
You can monitor the deployment status and check the last update time for each code location in the [Deployment tab](https://dagster.opensource.observer/locations).
136
+
137
+
Once the deployment is complete, Dagster will automatically recognize your new asset, and it will appear in the [Global Asset List](https://dagster.opensource.observer/assets).
This guide outlines common operations for working with [Dagster](https://dagster.io/) in the OSO project.
9
+
10
+
## SQLMesh Integration
11
+
12
+
Our main data models are materialized using SQLMesh. In most cases, you can trigger the `sqlmesh_all_assets` job with its default configuration to update the models.
13
+
14
+
### Restatements
15
+
16
+
:::warning
17
+
**DO NOT RESTATE SQLMesh models without approval!**
18
+
:::
19
+
20
+
If you need to run a restatement, you will need to edit the configuration of the `sqlmesh_all_assets` job. Select the dropdown menu next to the **Materialize all** button and click **Open launchpad**.
21
+
22
+
There are two ways to specify which models to restate:
23
+
24
+
-**By Entity Category**: Set `restate_by_entity_category: true` and specify a list of categories to restate. You can assign categories to models using the `entity_category=category_name` tag.
25
+
-**By Model Name**: Provide a list of model names under the `restate_models` configuration. Remember to prefix the model name with `oso.`, for example: `oso.int_events__superchain_internal_transactions`. When using this method, all SQLMesh [model selection features](https://sqlmesh.readthedocs.io/en/stable/guides/model_selection/) can be used.
26
+
27
+
Dagster jobs have a default of three retry attempts. However, retries use the same configuration. If a job fails mid-process, cancel the retry and trigger a new run with the correct configuration to avoid restating models multiple times.
28
+
29
+
An example configuration:
30
+
31
+
```yaml
32
+
ops:
33
+
sqlmesh_project:
34
+
config:
35
+
restate_by_entity_category: false
36
+
restate_models:
37
+
- oso.stg_github__XYZ
38
+
skip_tests: false
39
+
use_dev_environment: false
40
+
```
41
+
42
+
This will restate the `stg_github__XYZ` staging models and all downstream SQLMesh models in the warehouse.
43
+
44
+
### Branching with Tags
45
+
46
+
We use Nessie's branching feature to ensure data consumers always have access to stable data. We maintain a `consumer` tag that points to a stable version of the data for public API consumers, while the `main` branch is actively updated.
47
+
48
+
Our producer, Trino, has two catalogs:
49
+
50
+
- `iceberg`: Points to the `main` branch.
51
+
- `iceberg_consumer`: Points to the `consumer` tag.
52
+
53
+
After a successful run of the `sqlmesh_all_assets` job and data verification, run the `nessie_consumer_tag_job` to update the `consumer` tag to the latest `main` commit. You can also specify a particular hash in the `to_hash` configuration to move the tag to a specific commit.
54
+
55
+
## Asset Development Workflow
56
+
57
+
When creating new Dagster assets, it's important to also write a seed file before integrating it into SQLMesh.
58
+
59
+
The workflow is as follows:
60
+
61
+
1. **Write the Asset**:
62
+
- Follow the cursor rules for creating new assets.
63
+
- Keep column names consistent with the original source.
64
+
- Perform minimal normalization and unnesting.
65
+
66
+
2. **Run Dagster Locally**:
67
+
- Confirm that you can materialize the source correctly.
68
+
69
+
3. **Submit and Merge a PR**:
70
+
- Submit a pull request with your changes and merge it into production.
71
+
72
+
4. **Materialize in Production**:
73
+
- Materialize the asset in the production Dagster environment.
74
+
75
+
5. **Verify Data**:
76
+
- Sample the data in BigQuery to confirm it's correct.
77
+
78
+
6. **Create Seed File and Staging Model**:
79
+
- Follow the cursor rules for creating seed files and staging models.
80
+
- Use a sample of 5-10 rows of real data from BigQuery that cover multiple cases.
81
+
- If there are date fields, set them to `datetime.now()`.
82
+
- Test locally with SQLMesh until there are no errors.
83
+
84
+
7. **Submit and Merge a PR**:
85
+
- Submit a pull request with the seed file and staging model and merge it into production.
0 commit comments