This document provides guidance for AI agents working on the basedosdados/pipelines repository. It covers project structure, workflows, conventions, and important rules to follow.
This repository contains:
- Prefect flows (
pipelines/datasets/) — data capture and ingestion pipelines for frequently updated datasets. - dbt models (
models/) — ELT/ETL transformations that materialize data in BigQuery for less-frequently updated datasets.
The primary datalake target is Google BigQuery, accessed via the basedosdados Python SDK.
- Python version:
>=3.10,<3.11 - Package manager:
uv - Install dependencies:
uv sync - Install pre-commit hooks:
uv run pre-commit install --install-hooks - Install dbt packages:
uv run dbt deps - dbt version:
dbt-core==1.5.6withdbt-bigquery==1.5.9 - Dockerfile: the production runtime is defined in
Dockerfile. It installs system dependencies. When debugging missing system libraries, check the Dockerfile first — native dependencies must be added there, not inpyproject.toml.
pipelines/
├── pipelines/
│ └── datasets/ # One directory per dataset (Prefect flows)
│ └── <dataset_id>/
│ ├── __init__.py
│ ├── constants.py # Dataset-level constants
│ ├── flows.py # Prefect flow definitions
│ ├── schedules.py # Prefect schedule definitions
│ ├── tasks.py # Prefect task definitions
│ └── utils.py # Helper functions
├── models/ # dbt models (one dir per dataset)
│ └── <dataset_id>/
│ ├── <dataset_id>__<table_id>.sql
│ └── schema.yml
├── macros/ # dbt macros
├── tests-dbt/generic/ # Custom generic dbt tests
├── dbt_project.yml
├── manage.py # CLI for creating/listing pipelines
├── pyproject.toml
└── profiles.yml
Important
This project uses Prefect v0.15.9, which is a very old version of Prefect 1.x. The API is completely different from Prefect 2.x/3.x. Do not use Prefect 2/3 patterns or documentation. Always refer to the Prefect 0.15.x docs and the existing code in pipelines/datasets/ as reference.
Use manage.py to scaffold a new pipeline from the template:
uv run manage.py add-pipeline <dataset_id><dataset_id>must be in snake_case and must be unique.- To list existing pipelines:
uv run manage.py list-pipelines
flows.py: Define PrefectFlowobjects. Each flow must be imported in the parent__init__.py.tasks.py: Define PrefectTaskobjects.schedules.py: DefineScheduleobjects linked to flows.constants.py: Use aConstantsenum or plain constants — no hardcoded values elsewhere.utils.py: Pure helper functions with no Prefect decorators.
Create a test.py at the repo root:
from pipelines.datasets.<dataset_id>.flows import flow
from pipelines.utils.utils import run_local
run_local(flow, parameters={"param": "val"})Run with: uv run test.py
- Copy
.env.exampleto.envand fill inGOOGLE_APPLICATION_CREDENTIALSandVAULT_TOKEN. - Load variables:
source .env - Ensure
~/.prefect/auth.tomlexists withapi_keyandtenant_id. - Create
test.pyusingrun_cloudand run withuv run test.py.
SQL files follow the pattern: <dataset_id>__<table_id>.sql (double underscore separator).
All models must use this macro to reference staging data:
select col_name
from {{ set_datalake_project("<dataset_id>_staging.<table_id>") }}- Do not use
set_datalake_projectfor joins. Joins must reference production tables directly:basedosdados.<dataset_id>.<table_id>.
# Single model by name
dbt run --select <dataset_id>__<table_id>
# All models in a dataset directory
dbt run --select models/<dataset_id>The default --target is dev, which reads from basedosdados-dev and writes to basedosdados-dev. Do not specify --target during local development.
Every model must have a schema.yml entry inside its dataset directory. This file declares model metadata and all data quality tests. Custom generic tests live in tests-dbt/generic/.
Use relationships to validate foreign keys against directory tables:
models:
- name: dataset_id__table_id
columns:
- name: id_municipio
tests:
- relationships:
to: ref('br_bd_diretorios_brasil__municipio')
field: id_municipiomodels:
- name: dataset_id__table_id
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [col_a, col_b]Validates that at least a given proportion of rows are non-null across multiple columns:
models:
- name: dataset_id__table_id
tests:
- not_null_proportion_multiple_columns:
at_least: 0.95Allows ignoring specific values and tolerating a proportion of unmatched rows. Always document the exceptions in the model description.
models:
- name: dataset_id__table_id
description: "Table description. Exception: value '5410' is ignored in id_sh4 relationship test because ..."
tests:
- custom_relationships:
to: ref('br_bd_diretorios_mundo__sistema_harmonizado')
field: id_sh4
ignore_values: ['5410']
proportion_allowed_failures: 0Allows a proportion of duplicate key combinations. Use sparingly — it can mask duplicate rows. Always document the exceptions in the model description.
models:
- name: dataset_id__table_id
description: "Table description. Exception: up to 5% duplicate combinations allowed because ..."
tests:
- custom_unique_combinations_of_columns:
combination_of_columns: [col_a, col_b]
proportion_allowed_failures: 0.05For large tables, scope tests to only the most recent rows using the where config. The macro custom_get_where_subquery (macros/custom_get_where_subquery.sql) detects these keywords and replaces them with the actual most-recent values at runtime.
| Keyword | Columns used |
|---|---|
__most_recent_year_month__ |
ano, mes |
__most_recent_date__ |
data |
__most_recent_year__ |
ano |
models:
- name: dataset_id__table_id
tests:
- custom_unique_combinations_of_columns:
combination_of_columns: [col_a]
proportion_allowed_failures: 0.05
config:
where: __most_recent_year_month__You can also pass a literal SQL expression instead of a keyword:
config:
where: "date_column = '2024-01-01'"dbt test --select <dataset_id>__<table_id>
dbt test --select models/<dataset_id>- Linter: Ruff (
uv run ruff check .) — line length 79, Python 3.10 target. - SQL formatter: sqlfmt (
uv run sqlfmt .) — excludestarget/,dbt_packages/,.venv/. - YAML formatter: yamlfix.
- Pre-commit hooks enforce all of the above automatically on commit.
- Never bypass hooks with
--no-verify. - Add type hints and docstrings for python functions following Google Style.
To onboard a new dataset (raw data → BigQuery → metadata), spawn the onboarding agent:
Onboard dataset <slug>. Raw files at <path>. Drive folder: BD/Dados/Conjuntos/<slug>/.
The agent will run the full 10-step sequence (context → architecture → clean → upload → dbt → tests → discover → metadata → prod → PR), pausing for human approval before promoting to production.
- Never hardcode credentials or secrets. Use environment variables or Vault.
- Always use
set_datalake_projectmacro in model SQL files, except for joins which must use production project references. - Follow snake_case for all dataset/pipeline names.
- Run
uv run pre-commit run --all-filesafter making changes to verify formatting and linting before committing. - Do not modify
dbt_packages/ortarget/— these are generated directories. - Do not create a
test.pyfile with real credentials — it is gitignored and for local use only. - Document exceptions in
schema.ymlmodel descriptions when usingcustom_relationshipsorcustom_unique_combinations_of_columnswith non-zeroproportion_allowed_failures. - When adding a new dataset pipeline, always run
uv run manage.py add-pipeline <name>rather than creating files manually. - The
dbtCLI must be run inside the activated virtual environment:source .venv/bin/activateor viauv run dbt ....