Governance docs (#371)

ividito · web-flow · commit 9650a1b06df0 · 2025-06-27T12:31:43.000-03:00
* Draft governance and task design docs

* Fixing docs TODOs

* Review changes
diff --git a/README.md b/README.md
@@ -48,8 +48,6 @@ To retrieve the variables for a stage that has been previously deployed, the sec
 > [!IMPORTANT]
 > Be careful not to check in `.env` (or whatever you called your env file) when committing work.
 
-Currently, the client id and domain of an existing Cognito user pool programmatic client must be supplied in [configuration](ingest_api/infrastructure/config.py) as `VEDA_CLIENT_ID` and `VEDA_COGNITO_DOMAIN` (the [veda-auth project](https://github.com/NASA-IMPACT/veda-auth) can be used to deploy a Cognito user pool and client). To dispense auth tokens via the workflows API swagger docs, an administrator must add the ingest API lambda URL to the allowed callbacks of the Cognito client.
-
 
 ### Setup a local SM2A development environment
 
diff --git a/dags/veda_data_pipeline/groups/discover_group.py b/dags/veda_data_pipeline/groups/discover_group.py
@@ -24,6 +24,11 @@ def discover_from_s3_task(event: dict={}, ti=None, payload: dict={}, prev_start_
         **event,
         **payload,
     }
+    
+    # TODO: verify that scheduled DAGS that include a discovery step will work without this config mutation
+    if not ti.dag_run.conf:
+        ti.dag_run.conf = config
+
     if event.get("schedule") and prev_start_date_success:
         config["last_successful_execution"] = prev_start_date_success.isoformat()
     # (event, chunk_size=2800, role_arn=None, bucket_output=None):
diff --git a/docs/contributing/add_a_dag.md b/docs/contributing/add_a_dag.md
@@ -33,6 +33,30 @@ def example_dag():
     foo >> bar
 ```
 
+Additionally, we prefer to use bitshift operators (`>>`) to define task dependencies where there are no parameterized dependencies. This is to maintain readability and consistency across DAGs. The following examples illustrate the preferred way to define task dependencies:
+
+Preferred:
+
+```python
+output_a = task_a()
+output_b = task_b(input=output_a)
+```
+
+Not preferred:
+
+```python
+def task_b():
+    xcom_pull("task_a")
+
+task_a >> task_b
+```
+
+But if the dependency is orchestration-only, it's fine to do something like:
+
+```python
+output_c = task_a() >> task_b() >> task_c()
+output_d = task_d(input=output_d)
+```
 
 ## Naming Conventions
 
diff --git a/docs/contributing/project_governance.md b/docs/contributing/project_governance.md
@@ -0,0 +1,83 @@
+# SM2A Project Governance
+
+## 1. Introduction
+
+This is a evolving document that outlines the governance structure, development workflow, and best practices for the SM2A project. It serves as a guide for contributors to understand how to effectively participate in the project.
+
+## 2. Code Ownership
+
+* **Module Ownership**
+
+  * `self-managed-apache-airflow` (Terraform module for underlying infrastructure patterns): governed by **@amarouane-ABDELHAK**
+* **Contact**
+
+  * Open an issue in GitHub.
+  * Reach out on Slack in **#veda-data-services** (VEDA data services team).
+
+## 3. Development Workflow
+
+### 3.1 Branching Strategy
+
+* `dev` is the primary branch used for active development, deployment, and release tagging.
+* Releases are tagged directly off `dev`.
+* New branches should be prefixed with descriptors such as `fix/`, `feature/`, `demo/`, etc.
+* Delete feature branches immediately after they are merged.
+
+### 3.2 Code Review Guidelines
+
+* Every pull request must receive at least one approving review from a maintainer who is not the author.
+* If multiple developers contributed to a PR's commits, at least one review should come from an uninvolved maintainer.
+* Reviewers focus on correctness, readability, security, and alignment with project conventions and docs.
+* Authors should aim to keep PRs small and focused; larger changes should be split when possible.
+* All tests must pass before a PR is merged.
+* Review comments should be constructive and reference specific code lines or documentation.
+* Use GitHub suggestions or follow‑up commits for requested changes.
+
+### 3.3 Code Style & Conventions
+
+* Follow the best practices for authoring and maintaining Airflow DAGs outlined in the project `docs/` directory.
+
+## 4. Testing Standards
+
+### 4.1 Unit Testing
+
+* PRs which add new DAGs or tasks must include at least one unit test written with **pytest**.
+* New DAGs must pass the existing tests, which validate the DAG structure and task dependencies.
+* Infrastructure changes must run `terraform validate` and succeed before merge.
+* The project currently has no formal coverage target, but contributors are encouraged to expand coverage wherever practical.
+
+## 4.2 Integration Testing
+
+* DAGs are automatically tested for structural correctness and import validity.
+* The `dev` or `sit` environments should be used for manual integration tests against the dev environment from `veda-backend`
+
+## 5. Environments & Deployment
+
+### 5.1 Environments
+
+* The `dev` environment is used for development and continuous deployment.
+* The `sit` environment is used for integration testing, where changes might interfere with continuous changes being made to the dev environment.
+* Additional "production" environments are deployed using [veda-deploy](https://github.com/NASA-IMPACT/veda-deploy) and are not managed by this project.
+
+### 5.2 Deployment Procedures
+
+* The `dev` environment is continuously deployed using GitHub Actions. Manual deployments are possible, but not recommended.
+* The `sit` environment is manually deployed using `make sm2a-deploy`. This requires updating the Makefile with a secret name corresponding to the live environment (currently `veda-sm2a-sit-deployment-secrets`). 
+
+### 5.3. Releases
+
+* Releases are tagged directly off `dev` and are automatically deployed to the `dev` environment.
+
+### 5.4 Secrets Handling
+
+* Updates to secrets should be made in AWS secrets manager, which will automatically update environment for subsequent deployments.
+
+## 6 Architecture Decision Records
+
+* Architecture decisions are documented in the veda-architecture [repository](https://github.com/NASA-IMPACT/veda-architecture).
+
+## 7. Change Log & Versioning
+
+- Does not exist yet, but should be automated. To support this, we should use [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) to generate a changelog.
+- The changelog should be generated automatically using semantic-release.
+- To help maintain a clean history, PRs should be rebased and squashed to consolidate unneeded commits before merging.
diff --git a/docs/contributing/task_design.md b/docs/contributing/task_design.md
@@ -0,0 +1,114 @@
+# Airflow Task Design Guide
+
+A concise reference for contributors designing, implementing, and reviewing tasks in SM2A projects.*
+
+
+## Airflow Task Design Goals
+
+| # | Goals | What it means | Why it matters |
+| - | - | - | - |
+| 1 | **Explicit parameters** | Declare every input (e.g., `bucket: str`, `run_date: datetime`) as a named argument in the TaskFlow function signature. | Readers (and IDEs) know exactly what values the task needs; type hints support linting & autocompletion. |
+| 2 | **Direct parameter access** | Pass scalar / simple objects directly—avoid wrapping them in catch‑all dicts or `**kwargs`. | Prevents “mystery meat” payloads and accidental hidden dependencies. |
+| 3 | **Multiple named outputs** | Return a `dict` of discrete outputs via `return {"records": df, "count": len(df)}`, leveraging TaskFlow’s multiple return feature (this can be implicit by returning a dict, or explicit with `@task(multiple_outputs=True)`). | Downstream tasks can pull *only* what they need and don't need to parse larger objects. |
+| 4 | **TaskFlow‑first** | Define tasks with `@task` (TaskFlow) rather than classic operators when writing Python tasks. | Makes tasks testable with `pytest` and keeps DAGs readable. |
+| 5 | **Separation of concerns** | Task functions orchestrate **data flow and execution**; computation and logic lives in `util` functions/modules imported by the task. | Logic can be unit‑tested in isolation and reused in other tasks. |
+| 6 | **Idempotency** | Tasks should safely re‑run without corrupting state; leverage run‑date‑based keys, checksums, or existence checks. | Supports retries & backfills. |
+
+
+## Recommended Patterns
+
+### Minimal TaskFlow Example
+
+```python
+from airflow.decorators import dag, task
+from pendulum import datetime
+from utils.stac import generate_collection  # util function (external)
+
+@dag(
+    schedule="@daily",
+    start_date=datetime(2024, 1, 1),
+    catchup=False,
+    params={  # DAG‑level parameters accessible to the first task
+        "collection_id": "sample-collection",
+        "description": "Sample STAC collection generated via Airflow",
+    },
+    tags=["stac", "example"],
+)
+def stac_collection_dag(params=None):  # Airflow injects params dict
+
+    @task(multiple_outputs=True)
+    def build_collection(collection_id: str, description: str) -> dict[str, str]:
+        """Generate a STAC collection body and return both body and ID."""
+        collection_body = generate_collection(
+            collection_id=collection_id,
+            description=description,
+        )  # heavy lifting happens in utils
+        return {"collection_body": collection_body, "collection_id": collection_id}
+
+    # Task invocation – passing explicit params pulled from dag.params
+    outputs = build_collection(
+        collection_id=params["collection_id"],
+        description=params["description"],
+    )
+
+    @task()
+    def publish_collection(collection_body: dict):
+        """Pass collection to ingestion API."""
+        ingest_collection(collection_body)
+
+    publish_collection(collection_body=outputs["collection_body"])
+
+stac_collection_dag()
+```
+
+*Key takeaways:* explicit arg names, `multiple_outputs`, util functions (`fetch_api_events`, `normalize_events`, `write_to_warehouse`).
+
+### Multiple outputs with `@task(multiple_outputs=True)`
+
+Use when returning more than one value so Airflow stores each key as a separate XCom value:
+
+```python
+@task(multiple_outputs=True)
+def split_dataset(path: str) -> dict[str, str]:
+    train, test = make_splits(path)
+    return {"train_path": train, "test_path": test}
+```
+
+Down‑stream tasks access exactly what they need:
+
+```python
+training_data, test_data = split_dataset(path="s3://bucket/data.csv")
+train_model(train_path=training_data) # only need training data from the first task
+test_model(model=train_model, test_path=test_data) # only need test data from the first task
+```
+
+### Delegating compute to utils
+
+```python
+from veda.utils.stac import generate_collection  # example util function
+
+@task(multiple_outputs=True)
+def build_collection(collection_id: str, description: str) -> dict[str, str]:
+    """Generate a STAC collection body and return both body and ID."""
+    collection_body = generate_collection(
+        collection_id=collection_id,
+        description=description,
+    )  # logic is contained in util function
+    return {"collection_body": collection_body, "collection_id": collection_id}
+```
+
+
+## Anti‑Patterns to Avoid
+
+| Anti‑Pattern  | Why to avoid |
+| --------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
+| **Monolithic payloads**: outputting multiple values into a dict or JSON and passing to next task as a single XCom | Downstream tasks must deserialize and know key names; incidental tight coupling between tasks. |
+| **Hidden parameters**: accessing fields on `kwargs["ti"].xcom_pull()` (or similar) inside tasks | Hides dependencies; makes signatures lie; breaks static analysis & tests. |
+| **Heavy logic in DAG file**: performing data transformations directly in the DAG definition | Complicates refactors; hampers testability; Increases DAG parse time |
+| **Non‑idempotent side effects**: tasks must be idempotent - each task does one thing, and can be reversed or retried independently | Retries/backfills can cause duplicated data or data loss. |
+
+
+## Further Reading
+* [Airflow 2 TaskFlow API docs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/taskflow.html)
+* "DAG writing best practices in Apache Airflow" – [Astronomer article](https://www.astronomer.io/docs/learn/dag-best-practices/)
+* [Adding a DAG](docs/contributing/add_a_general_dag.md)
diff --git a/docs/operating/ingestion_configuration.md b/docs/operating/ingestion_configuration.md
@@ -58,5 +58,3 @@ This pipeline is designed to handle the ingestion of both vector and raster data
 
 ## Pipeline Behaviour
 Since this pipeline can ingest both raster and vector data, the configuration can be modified accordingly. The `"vector": true` triggers the `generic_ingest_vector` dag. If the `collection` is provided, it uses the collection name as the table name for ingestion (recommended to use `append` extra_flag when the collection is provided). When no `collection` is provided, it uses the `id_template` and generates a table name by appending the actual ingested filename to the id_template (recommended to use `overwrite` extra flag).
-
-Setting `"vector_eis": true` will trigger the EIS Fire specific `ingest_vector` dag. If neither of these flags is set, the raster ingestion will be triggered, with the configuration typically looking like the raster ingestion example above.