Skip to content

Commit b95d5ae

Browse files
authored
[OSO-921] add documentation on dagster about SQLMesh and Seed data (#5069)
* docs(dagster): add documentation on dagster about SQLMesh and Seed data * docs(dagster): fix PR comments * docs(dagster): add doc about build time in dagster
1 parent 18aca58 commit b95d5ae

File tree

4 files changed

+127
-27
lines changed

4 files changed

+127
-27
lines changed

apps/docs/docs/contribute-data/setup/index.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -130,10 +130,11 @@ For more details on contributing to OSO, check out
130130

131131
### Verify deployment
132132

133-
Our Dagster deployment should automatically recognize the asset
134-
after merging your pull request to the main branch.
135-
You should be able to find your new asset
136-
in the [global asset list](https://dagster.opensource.observer/assets).
133+
After your pull request is merged into the `main` branch, a new Dagster deployment is automatically triggered. This process typically takes 10-15 minutes.
134+
135+
You can monitor the deployment status and check the last update time for each code location in the [Deployment tab](https://dagster.opensource.observer/locations).
136+
137+
Once the deployment is complete, Dagster will automatically recognize your new asset, and it will appear in the [Global Asset List](https://dagster.opensource.observer/assets).
137138

138139
![Dagster assets](./dagster_assets.png)
139140

@@ -197,7 +198,7 @@ issue](https://github.com/opensource-observer/oso/issues/4840)
197198
#### Available Code Locations
198199

199200
At this moment, the available code locations can be found in the repository at
200-
`warehouse/oso_dagster/definitions/`):
201+
`warehouse/oso_dagster/definitions/`:
201202

202203
- `sqlmesh`: This is the code location for _any_ assets
203204
related to sqlmesh. This is essentially anything that depends on the
@@ -257,3 +258,35 @@ Now it should be possible run sqlmesh and dagster locally. When materializing
257258
sqlmesh assets, it might complain about some out of date dependencies. Since we
258259
ran the local test setup, the data it's depending on should have been added by
259260
the oso local seed setup.
261+
262+
### Seed and Staging data
263+
264+
When creating new Dagster assets, it's important to also write a seed file before integrating it into a SQLMesh staging model.
265+
266+
The workflow is as follows:
267+
268+
1. **Write the Asset**:
269+
- Follow the cursor rules for creating new assets.
270+
- Keep column names consistent with the original source.
271+
- Perform minimal normalization and unnesting.
272+
273+
2. **Run Dagster Locally**:
274+
- Confirm that you can materialize the source correctly.
275+
276+
3. **Submit and Merge a PR**:
277+
- Submit a pull request with your changes and merge it into production.
278+
279+
4. **Materialize in Production**:
280+
- Materialize the asset in the production Dagster environment.
281+
282+
5. **Verify Data**:
283+
- Sample the data in BigQuery to confirm it's correct.
284+
285+
6. **Create Seed File and Staging Model**:
286+
- Follow the cursor rules for creating seed files and staging models.
287+
- Use a sample of 5-10 rows of real data from BigQuery that cover multiple cases.
288+
- If there are date fields, set them to `datetime.now()`.
289+
- Test locally with SQLMesh until there are no errors.
290+
291+
7. **Submit and Merge a PR**:
292+
- Submit a pull request with the seed file and staging model and merge it into production.

apps/docs/docs/guides/dagster.md

Lines changed: 3 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -92,27 +92,8 @@ into our production keystore.
9292

9393
### Restating SQLMesh models
9494

95-
:::warning
96-
**DO NOT RESTATE SQLMesh models without approval!**
97-
:::
98-
99-
If you need to restate a SQLMesh model, you can do so via the Dagster UI.
100-
101-
Select the job, e.g., `sqlmesh_all_assets`.
95+
To learn how to restate SQLMesh models, check the [SQLMesh ops guide](./ops/dagster.md).
10296

103-
Then select the dropdown menu next to the **Materialize all** button and click **Open launchpad**.
104-
105-
Update the config to include the model you want to restate, for example:
106-
107-
```yaml
108-
ops:
109-
sqlmesh_project:
110-
config:
111-
restate_by_entity_category: false
112-
restate_models:
113-
- oso.stg_github__XYZ
114-
- oso.stg_github__XYZ_2
115-
skip_tests: false
116-
```
97+
### Seed Data
11798

118-
This will restate the `stg_github__XYZ` and `stg_github__XYZ_2` staging models and all downstream SQLMesh models in the warehouse.
99+
To test the integration between dagster assets and SQLMesh models, check the [Seed Data section](../contribute-data/index.md)
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: Dagster Playbook
3+
sidebar_position: 2
4+
---
5+
6+
# Dagster Playbook
7+
8+
This guide outlines common operations for working with [Dagster](https://dagster.io/) in the OSO project.
9+
10+
## SQLMesh Integration
11+
12+
Our main data models are materialized using SQLMesh. In most cases, you can trigger the `sqlmesh_all_assets` job with its default configuration to update the models.
13+
14+
### Restatements
15+
16+
:::warning
17+
**DO NOT RESTATE SQLMesh models without approval!**
18+
:::
19+
20+
If you need to run a restatement, you will need to edit the configuration of the `sqlmesh_all_assets` job. Select the dropdown menu next to the **Materialize all** button and click **Open launchpad**.
21+
22+
There are two ways to specify which models to restate:
23+
24+
- **By Entity Category**: Set `restate_by_entity_category: true` and specify a list of categories to restate. You can assign categories to models using the `entity_category=category_name` tag.
25+
- **By Model Name**: Provide a list of model names under the `restate_models` configuration. Remember to prefix the model name with `oso.`, for example: `oso.int_events__superchain_internal_transactions`. When using this method, all SQLMesh [model selection features](https://sqlmesh.readthedocs.io/en/stable/guides/model_selection/) can be used.
26+
27+
Dagster jobs have a default of three retry attempts. However, retries use the same configuration. If a job fails mid-process, cancel the retry and trigger a new run with the correct configuration to avoid restating models multiple times.
28+
29+
An example configuration:
30+
31+
```yaml
32+
ops:
33+
sqlmesh_project:
34+
config:
35+
restate_by_entity_category: false
36+
restate_models:
37+
- oso.stg_github__XYZ
38+
skip_tests: false
39+
use_dev_environment: false
40+
```
41+
42+
This will restate the `stg_github__XYZ` staging models and all downstream SQLMesh models in the warehouse.
43+
44+
### Branching with Tags
45+
46+
We use Nessie's branching feature to ensure data consumers always have access to stable data. We maintain a `consumer` tag that points to a stable version of the data for public API consumers, while the `main` branch is actively updated.
47+
48+
Our producer, Trino, has two catalogs:
49+
50+
- `iceberg`: Points to the `main` branch.
51+
- `iceberg_consumer`: Points to the `consumer` tag.
52+
53+
After a successful run of the `sqlmesh_all_assets` job and data verification, run the `nessie_consumer_tag_job` to update the `consumer` tag to the latest `main` commit. You can also specify a particular hash in the `to_hash` configuration to move the tag to a specific commit.
54+
55+
## Asset Development Workflow
56+
57+
When creating new Dagster assets, it's important to also write a seed file before integrating it into SQLMesh.
58+
59+
The workflow is as follows:
60+
61+
1. **Write the Asset**:
62+
- Follow the cursor rules for creating new assets.
63+
- Keep column names consistent with the original source.
64+
- Perform minimal normalization and unnesting.
65+
66+
2. **Run Dagster Locally**:
67+
- Confirm that you can materialize the source correctly.
68+
69+
3. **Submit and Merge a PR**:
70+
- Submit a pull request with your changes and merge it into production.
71+
72+
4. **Materialize in Production**:
73+
- Materialize the asset in the production Dagster environment.
74+
75+
5. **Verify Data**:
76+
- Sample the data in BigQuery to confirm it's correct.
77+
78+
6. **Create Seed File and Staging Model**:
79+
- Follow the cursor rules for creating seed files and staging models.
80+
- Use a sample of 5-10 rows of real data from BigQuery that cover multiple cases.
81+
- If there are date fields, set them to `datetime.now()`.
82+
- Test locally with SQLMesh until there are no errors.
83+
84+
7. **Submit and Merge a PR**:
85+
- Submit a pull request with the seed file and staging model and merge it into production.

apps/docs/docs/guides/ops/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,5 @@ The OSO architecture runs on the following platforms:
1414
- [Hasura](./hasura): GraphQL API service
1515
- [Supabase](./supabase): user authentication and user database
1616
- [Ops Video Guides](./video-guides.md): Ops video guides for managing the infrastructure
17+
- [Dagster Playbook](./dagster): Common tasks executed in Dagster
1718
- [Archive Node Guide](./archive-nodes.md): Guide on creating a new archive node

0 commit comments

Comments
 (0)