Skip to content

Commit 9650a1b

Browse files
authored
Governance docs (#371)
* Draft governance and task design docs * Fixing docs TODOs * Review changes
1 parent 771dda4 commit 9650a1b

File tree

6 files changed

+226
-4
lines changed

6 files changed

+226
-4
lines changed

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,6 @@ To retrieve the variables for a stage that has been previously deployed, the sec
4848
> [!IMPORTANT]
4949
> Be careful not to check in `.env` (or whatever you called your env file) when committing work.
5050
51-
Currently, the client id and domain of an existing Cognito user pool programmatic client must be supplied in [configuration](ingest_api/infrastructure/config.py) as `VEDA_CLIENT_ID` and `VEDA_COGNITO_DOMAIN` (the [veda-auth project](https://github.com/NASA-IMPACT/veda-auth) can be used to deploy a Cognito user pool and client). To dispense auth tokens via the workflows API swagger docs, an administrator must add the ingest API lambda URL to the allowed callbacks of the Cognito client.
52-
5351

5452
### Setup a local SM2A development environment
5553

dags/veda_data_pipeline/groups/discover_group.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@ def discover_from_s3_task(event: dict={}, ti=None, payload: dict={}, prev_start_
2424
**event,
2525
**payload,
2626
}
27+
28+
# TODO: verify that scheduled DAGS that include a discovery step will work without this config mutation
29+
if not ti.dag_run.conf:
30+
ti.dag_run.conf = config
31+
2732
if event.get("schedule") and prev_start_date_success:
2833
config["last_successful_execution"] = prev_start_date_success.isoformat()
2934
# (event, chunk_size=2800, role_arn=None, bucket_output=None):
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,30 @@ def example_dag():
3333
foo >> bar
3434
```
3535

36+
Additionally, we prefer to use bitshift operators (`>>`) to define task dependencies where there are no parameterized dependencies. This is to maintain readability and consistency across DAGs. The following examples illustrate the preferred way to define task dependencies:
37+
38+
Preferred:
39+
40+
```python
41+
output_a = task_a()
42+
output_b = task_b(input=output_a)
43+
```
44+
45+
Not preferred:
46+
47+
```python
48+
def task_b():
49+
xcom_pull("task_a")
50+
51+
task_a >> task_b
52+
```
53+
54+
But if the dependency is orchestration-only, it's fine to do something like:
55+
56+
```python
57+
output_c = task_a() >> task_b() >> task_c()
58+
output_d = task_d(input=output_d)
59+
```
3660

3761
## Naming Conventions
3862

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# SM2A Project Governance
2+
3+
## 1. Introduction
4+
5+
This is a evolving document that outlines the governance structure, development workflow, and best practices for the SM2A project. It serves as a guide for contributors to understand how to effectively participate in the project.
6+
7+
## 2. Code Ownership
8+
9+
* **Module Ownership**
10+
11+
* `self-managed-apache-airflow` (Terraform module for underlying infrastructure patterns): governed by **@amarouane-ABDELHAK**
12+
* **Contact**
13+
14+
* Open an issue in GitHub.
15+
* Reach out on Slack in **#veda-data-services** (VEDA data services team).
16+
17+
## 3. Development Workflow
18+
19+
### 3.1 Branching Strategy
20+
21+
* `dev` is the primary branch used for active development, deployment, and release tagging.
22+
* Releases are tagged directly off `dev`.
23+
* New branches should be prefixed with descriptors such as `fix/`, `feature/`, `demo/`, etc.
24+
* Delete feature branches immediately after they are merged.
25+
26+
### 3.2 Code Review Guidelines
27+
28+
* Every pull request must receive at least one approving review from a maintainer who is not the author.
29+
* If multiple developers contributed to a PR's commits, at least one review should come from an uninvolved maintainer.
30+
* Reviewers focus on correctness, readability, security, and alignment with project conventions and docs.
31+
* Authors should aim to keep PRs small and focused; larger changes should be split when possible.
32+
* All tests must pass before a PR is merged.
33+
* Review comments should be constructive and reference specific code lines or documentation.
34+
* Use GitHub suggestions or follow‑up commits for requested changes.
35+
36+
### 3.3 Code Style & Conventions
37+
38+
* Follow the best practices for authoring and maintaining Airflow DAGs outlined in the project `docs/` directory.
39+
40+
## 4. Testing Standards
41+
42+
### 4.1 Unit Testing
43+
44+
* PRs which add new DAGs or tasks must include at least one unit test written with **pytest**.
45+
* New DAGs must pass the existing tests, which validate the DAG structure and task dependencies.
46+
* Infrastructure changes must run `terraform validate` and succeed before merge.
47+
* The project currently has no formal coverage target, but contributors are encouraged to expand coverage wherever practical.
48+
49+
## 4.2 Integration Testing
50+
51+
* DAGs are automatically tested for structural correctness and import validity.
52+
* The `dev` or `sit` environments should be used for manual integration tests against the dev environment from `veda-backend`
53+
54+
## 5. Environments & Deployment
55+
56+
### 5.1 Environments
57+
58+
* The `dev` environment is used for development and continuous deployment.
59+
* The `sit` environment is used for integration testing, where changes might interfere with continuous changes being made to the dev environment.
60+
* Additional "production" environments are deployed using [veda-deploy](https://github.com/NASA-IMPACT/veda-deploy) and are not managed by this project.
61+
62+
### 5.2 Deployment Procedures
63+
64+
* The `dev` environment is continuously deployed using GitHub Actions. Manual deployments are possible, but not recommended.
65+
* The `sit` environment is manually deployed using `make sm2a-deploy`. This requires updating the Makefile with a secret name corresponding to the live environment (currently `veda-sm2a-sit-deployment-secrets`).
66+
67+
### 5.3. Releases
68+
69+
* Releases are tagged directly off `dev` and are automatically deployed to the `dev` environment.
70+
71+
### 5.4 Secrets Handling
72+
73+
* Updates to secrets should be made in AWS secrets manager, which will automatically update environment for subsequent deployments.
74+
75+
## 6 Architecture Decision Records
76+
77+
* Architecture decisions are documented in the veda-architecture [repository](https://github.com/NASA-IMPACT/veda-architecture).
78+
79+
## 7. Change Log & Versioning
80+
81+
- Does not exist yet, but should be automated. To support this, we should use [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) to generate a changelog.
82+
- The changelog should be generated automatically using semantic-release.
83+
- To help maintain a clean history, PRs should be rebased and squashed to consolidate unneeded commits before merging.

docs/contributing/task_design.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Airflow Task Design Guide
2+
3+
A concise reference for contributors designing, implementing, and reviewing tasks in SM2A projects.*
4+
5+
6+
## Airflow Task Design Goals
7+
8+
|| Goals | What it means | Why it matters |
9+
| - | - | - | - |
10+
| 1 | **Explicit parameters** | Declare every input (e.g., `bucket: str`, `run_date: datetime`) as a named argument in the TaskFlow function signature. | Readers (and IDEs) know exactly what values the task needs; type hints support linting & autocompletion. |
11+
| 2 | **Direct parameter access** | Pass scalar / simple objects directly—avoid wrapping them in catch‑all dicts or `**kwargs`. | Prevents “mystery meat” payloads and accidental hidden dependencies. |
12+
| 3 | **Multiple named outputs** | Return a `dict` of discrete outputs via `return {"records": df, "count": len(df)}`, leveraging TaskFlow’s multiple return feature (this can be implicit by returning a dict, or explicit with `@task(multiple_outputs=True)`). | Downstream tasks can pull *only* what they need and don't need to parse larger objects. |
13+
| 4 | **TaskFlow‑first** | Define tasks with `@task` (TaskFlow) rather than classic operators when writing Python tasks. | Makes tasks testable with `pytest` and keeps DAGs readable. |
14+
| 5 | **Separation of concerns** | Task functions orchestrate **data flow and execution**; computation and logic lives in `util` functions/modules imported by the task. | Logic can be unit‑tested in isolation and reused in other tasks. |
15+
| 6 | **Idempotency** | Tasks should safely re‑run without corrupting state; leverage run‑date‑based keys, checksums, or existence checks. | Supports retries & backfills. |
16+
17+
18+
## Recommended Patterns
19+
20+
### Minimal TaskFlow Example
21+
22+
```python
23+
from airflow.decorators import dag, task
24+
from pendulum import datetime
25+
from utils.stac import generate_collection # util function (external)
26+
27+
@dag(
28+
schedule="@daily",
29+
start_date=datetime(2024, 1, 1),
30+
catchup=False,
31+
params={ # DAG‑level parameters accessible to the first task
32+
"collection_id": "sample-collection",
33+
"description": "Sample STAC collection generated via Airflow",
34+
},
35+
tags=["stac", "example"],
36+
)
37+
def stac_collection_dag(params=None): # Airflow injects params dict
38+
39+
@task(multiple_outputs=True)
40+
def build_collection(collection_id: str, description: str) -> dict[str, str]:
41+
"""Generate a STAC collection body and return both body and ID."""
42+
collection_body = generate_collection(
43+
collection_id=collection_id,
44+
description=description,
45+
) # heavy lifting happens in utils
46+
return {"collection_body": collection_body, "collection_id": collection_id}
47+
48+
# Task invocation – passing explicit params pulled from dag.params
49+
outputs = build_collection(
50+
collection_id=params["collection_id"],
51+
description=params["description"],
52+
)
53+
54+
@task()
55+
def publish_collection(collection_body: dict):
56+
"""Pass collection to ingestion API."""
57+
ingest_collection(collection_body)
58+
59+
publish_collection(collection_body=outputs["collection_body"])
60+
61+
stac_collection_dag()
62+
```
63+
64+
*Key takeaways:* explicit arg names, `multiple_outputs`, util functions (`fetch_api_events`, `normalize_events`, `write_to_warehouse`).
65+
66+
### Multiple outputs with `@task(multiple_outputs=True)`
67+
68+
Use when returning more than one value so Airflow stores each key as a separate XCom value:
69+
70+
```python
71+
@task(multiple_outputs=True)
72+
def split_dataset(path: str) -> dict[str, str]:
73+
train, test = make_splits(path)
74+
return {"train_path": train, "test_path": test}
75+
```
76+
77+
Down‑stream tasks access exactly what they need:
78+
79+
```python
80+
training_data, test_data = split_dataset(path="s3://bucket/data.csv")
81+
train_model(train_path=training_data) # only need training data from the first task
82+
test_model(model=train_model, test_path=test_data) # only need test data from the first task
83+
```
84+
85+
### Delegating compute to utils
86+
87+
```python
88+
from veda.utils.stac import generate_collection # example util function
89+
90+
@task(multiple_outputs=True)
91+
def build_collection(collection_id: str, description: str) -> dict[str, str]:
92+
"""Generate a STAC collection body and return both body and ID."""
93+
collection_body = generate_collection(
94+
collection_id=collection_id,
95+
description=description,
96+
) # logic is contained in util function
97+
return {"collection_body": collection_body, "collection_id": collection_id}
98+
```
99+
100+
101+
## Anti‑Patterns to Avoid
102+
103+
| Anti‑Pattern | Why to avoid |
104+
| --------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
105+
| **Monolithic payloads**: outputting multiple values into a dict or JSON and passing to next task as a single XCom | Downstream tasks must deserialize and know key names; incidental tight coupling between tasks. |
106+
| **Hidden parameters**: accessing fields on `kwargs["ti"].xcom_pull()` (or similar) inside tasks | Hides dependencies; makes signatures lie; breaks static analysis & tests. |
107+
| **Heavy logic in DAG file**: performing data transformations directly in the DAG definition | Complicates refactors; hampers testability; Increases DAG parse time |
108+
| **Non‑idempotent side effects**: tasks must be idempotent - each task does one thing, and can be reversed or retried independently | Retries/backfills can cause duplicated data or data loss. |
109+
110+
111+
## Further Reading
112+
* [Airflow 2 TaskFlow API docs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/taskflow.html)
113+
* "DAG writing best practices in Apache Airflow" – [Astronomer article](https://www.astronomer.io/docs/learn/dag-best-practices/)
114+
* [Adding a DAG](docs/contributing/add_a_general_dag.md)

docs/operating/ingestion_configuration.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,5 +58,3 @@ This pipeline is designed to handle the ingestion of both vector and raster data
5858

5959
## Pipeline Behaviour
6060
Since this pipeline can ingest both raster and vector data, the configuration can be modified accordingly. The `"vector": true` triggers the `generic_ingest_vector` dag. If the `collection` is provided, it uses the collection name as the table name for ingestion (recommended to use `append` extra_flag when the collection is provided). When no `collection` is provided, it uses the `id_template` and generates a table name by appending the actual ingested filename to the id_template (recommended to use `overwrite` extra flag).
61-
62-
Setting `"vector_eis": true` will trigger the EIS Fire specific `ingest_vector` dag. If neither of these flags is set, the raster ingestion will be triggered, with the configuration typically looking like the raster ingestion example above.

0 commit comments

Comments
 (0)