[OSO-921] add documentation on dagster about SQLMesh and Seed data (#5069)

IcaroG · web-flow · commit b95d5aeda354 · 2025-09-30T15:39:53.000Z
* docs(dagster): add documentation on dagster about SQLMesh and Seed data

* docs(dagster): fix PR comments

* docs(dagster): add doc about build time in dagster
diff --git a/apps/docs/docs/contribute-data/setup/index.md b/apps/docs/docs/contribute-data/setup/index.md
@@ -130,10 +130,11 @@ For more details on contributing to OSO, check out
 
 ### Verify deployment
 
-Our Dagster deployment should automatically recognize the asset
-after merging your pull request to the main branch.
-You should be able to find your new asset
-in the [global asset list](https://dagster.opensource.observer/assets).
+After your pull request is merged into the `main` branch, a new Dagster deployment is automatically triggered. This process typically takes 10-15 minutes.
+
+You can monitor the deployment status and check the last update time for each code location in the [Deployment tab](https://dagster.opensource.observer/locations).
+
+Once the deployment is complete, Dagster will automatically recognize your new asset, and it will appear in the [Global Asset List](https://dagster.opensource.observer/assets).
 
 ![Dagster assets](./dagster_assets.png)
 
@@ -197,7 +198,7 @@ issue](https://github.com/opensource-observer/oso/issues/4840)
 #### Available Code Locations
 
 At this moment, the available code locations can be found in the repository at
-`warehouse/oso_dagster/definitions/`):
+`warehouse/oso_dagster/definitions/`:
 
 - `sqlmesh`: This is the code location for _any_ assets
   related to sqlmesh. This is essentially anything that depends on the
@@ -257,3 +258,35 @@ Now it should be possible run sqlmesh and dagster locally. When materializing
 sqlmesh assets, it might complain about some out of date dependencies. Since we
 ran the local test setup, the data it's depending on should have been added by
 the oso local seed setup.
+
+### Seed and Staging data
+
+When creating new Dagster assets, it's important to also write a seed file before integrating it into a SQLMesh staging model.
+
+The workflow is as follows:
+
+1.  **Write the Asset**:
+    - Follow the cursor rules for creating new assets.
+    - Keep column names consistent with the original source.
+    - Perform minimal normalization and unnesting.
+
+2.  **Run Dagster Locally**:
+    - Confirm that you can materialize the source correctly.
+
+3.  **Submit and Merge a PR**:
+    - Submit a pull request with your changes and merge it into production.
+
+4.  **Materialize in Production**:
+    - Materialize the asset in the production Dagster environment.
+
+5.  **Verify Data**:
+    - Sample the data in BigQuery to confirm it's correct.
+
+6.  **Create Seed File and Staging Model**:
+    - Follow the cursor rules for creating seed files and staging models.
+    - Use a sample of 5-10 rows of real data from BigQuery that cover multiple cases.
+    - If there are date fields, set them to `datetime.now()`.
+    - Test locally with SQLMesh until there are no errors.
+
+7.  **Submit and Merge a PR**:
+    - Submit a pull request with the seed file and staging model and merge it into production.
diff --git a/apps/docs/docs/guides/dagster.md b/apps/docs/docs/guides/dagster.md
@@ -92,27 +92,8 @@ into our production keystore.
 
 ### Restating SQLMesh models
 
-:::warning
-**DO NOT RESTATE SQLMesh models without approval!**
-:::
-
-If you need to restate a SQLMesh model, you can do so via the Dagster UI.
-
-Select the job, e.g., `sqlmesh_all_assets`.
+To learn how to restate SQLMesh models, check the [SQLMesh ops guide](./ops/dagster.md).
 
-Then select the dropdown menu next to the **Materialize all** button and click **Open launchpad**.
-
-Update the config to include the model you want to restate, for example:
-
-```yaml
-ops:
-  sqlmesh_project:
-    config:
-      restate_by_entity_category: false
-      restate_models:
-        - oso.stg_github__XYZ
-        - oso.stg_github__XYZ_2
-      skip_tests: false
-```
+### Seed Data
 
-This will restate the `stg_github__XYZ` and `stg_github__XYZ_2` staging models and all downstream SQLMesh models in the warehouse.
+To test the integration between dagster assets and SQLMesh models, check the [Seed Data section](../contribute-data/index.md)
diff --git a/apps/docs/docs/guides/ops/dagster.md b/apps/docs/docs/guides/ops/dagster.md
@@ -0,0 +1,85 @@
+---
+title: Dagster Playbook
+sidebar_position: 2
+---
+
+# Dagster Playbook
+
+This guide outlines common operations for working with [Dagster](https://dagster.io/) in the OSO project.
+
+## SQLMesh Integration
+
+Our main data models are materialized using SQLMesh. In most cases, you can trigger the `sqlmesh_all_assets` job with its default configuration to update the models.
+
+### Restatements
+
+:::warning
+**DO NOT RESTATE SQLMesh models without approval!**
+:::
+
+If you need to run a restatement, you will need to edit the configuration of the `sqlmesh_all_assets` job. Select the dropdown menu next to the **Materialize all** button and click **Open launchpad**.
+
+There are two ways to specify which models to restate:
+
+- **By Entity Category**: Set `restate_by_entity_category: true` and specify a list of categories to restate. You can assign categories to models using the `entity_category=category_name` tag.
+- **By Model Name**: Provide a list of model names under the `restate_models` configuration. Remember to prefix the model name with `oso.`, for example: `oso.int_events__superchain_internal_transactions`. When using this method, all SQLMesh [model selection features](https://sqlmesh.readthedocs.io/en/stable/guides/model_selection/) can be used.
+
+Dagster jobs have a default of three retry attempts. However, retries use the same configuration. If a job fails mid-process, cancel the retry and trigger a new run with the correct configuration to avoid restating models multiple times.
+
+An example configuration:
+
+```yaml
+ops:
+  sqlmesh_project:
+    config:
+      restate_by_entity_category: false
+      restate_models:
+        - oso.stg_github__XYZ
+      skip_tests: false
+      use_dev_environment: false
+```
+
+This will restate the `stg_github__XYZ` staging models and all downstream SQLMesh models in the warehouse.
+
+### Branching with Tags
+
+We use Nessie's branching feature to ensure data consumers always have access to stable data. We maintain a `consumer` tag that points to a stable version of the data for public API consumers, while the `main` branch is actively updated.
+
+Our producer, Trino, has two catalogs:
+
+- `iceberg`: Points to the `main` branch.
+- `iceberg_consumer`: Points to the `consumer` tag.
+
+After a successful run of the `sqlmesh_all_assets` job and data verification, run the `nessie_consumer_tag_job` to update the `consumer` tag to the latest `main` commit. You can also specify a particular hash in the `to_hash` configuration to move the tag to a specific commit.
+
+## Asset Development Workflow
+
+When creating new Dagster assets, it's important to also write a seed file before integrating it into SQLMesh.
+
+The workflow is as follows:
+
+1.  **Write the Asset**:
+    - Follow the cursor rules for creating new assets.
+    - Keep column names consistent with the original source.
+    - Perform minimal normalization and unnesting.
+
+2.  **Run Dagster Locally**:
+    - Confirm that you can materialize the source correctly.
+
+3.  **Submit and Merge a PR**:
+    - Submit a pull request with your changes and merge it into production.
+
+4.  **Materialize in Production**:
+    - Materialize the asset in the production Dagster environment.
+
+5.  **Verify Data**:
+    - Sample the data in BigQuery to confirm it's correct.
+
+6.  **Create Seed File and Staging Model**:
+    - Follow the cursor rules for creating seed files and staging models.
+    - Use a sample of 5-10 rows of real data from BigQuery that cover multiple cases.
+    - If there are date fields, set them to `datetime.now()`.
+    - Test locally with SQLMesh until there are no errors.
+
+7.  **Submit and Merge a PR**:
+    - Submit a pull request with the seed file and staging model and merge it into production.
diff --git a/apps/docs/docs/guides/ops/index.md b/apps/docs/docs/guides/ops/index.md
@@ -14,4 +14,5 @@ The OSO architecture runs on the following platforms:
 - [Hasura](./hasura): GraphQL API service
 - [Supabase](./supabase): user authentication and user database
 - [Ops Video Guides](./video-guides.md): Ops video guides for managing the infrastructure
+- [Dagster Playbook](./dagster): Common tasks executed in Dagster
 - [Archive Node Guide](./archive-nodes.md): Guide on creating a new archive node