|
| 1 | +# dbt Style Guide |
| 2 | + |
| 3 | +## Model Naming |
| 4 | +Our models (typically) fit into three main categories: staging, warehouse, marts. For more detail about aspects of this structure, check out [the dbt best practices](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). |
| 5 | + |
| 6 | +The file and naming structures are as follows (example): |
| 7 | +``` |
| 8 | +ssp_analytics |
| 9 | +├── .github |
| 10 | +│ ├── workflows |
| 11 | +│ │ ├── ci.yml |
| 12 | +│ │ ├── daily_refresh.yml |
| 13 | +│ │ └── post_merge_deploy.yml |
| 14 | +│ └── pull_request_template.md |
| 15 | +├── _project_docs |
| 16 | +│ ├── automation |
| 17 | +│ │ │ └── profiles.yml |
| 18 | +│ └── style_guide.md |
| 19 | +├── analyses |
| 20 | +├── seeds |
| 21 | +│ └── some_data.csv |
| 22 | +├── snapshots |
| 23 | +├── tests |
| 24 | +│ └── assert_some_test_scenario.sql |
| 25 | +├── macros |
| 26 | +│ ├── _macros__definitions.yml |
| 27 | +│ ├── _macros__docs.md |
| 28 | +│ └── generate_schema_name.sql |
| 29 | +├── models |
| 30 | +│ ├── marts |
| 31 | +│ │ ├── _marts__docs.md |
| 32 | +│ │ ├── _marts__models.yml |
| 33 | +│ │ └── nba_games_detail.sql |
| 34 | +│ ├── staging |
| 35 | +│ │ ├── nba |
| 36 | +│ │ │ ├── _nba__docs.md |
| 37 | +│ │ │ ├── _nba__models.yml |
| 38 | +│ │ │ ├── _nba__sources.yml |
| 39 | +│ │ │ ├── stg_nba__games.sql |
| 40 | +│ │ │ └── stg_nba__teams.sql |
| 41 | +│ │ └── gsheets |
| 42 | +│ │ ├── _gsheets__models.yml |
| 43 | +│ │ ├── _gsheets__sources.yml |
| 44 | +│ │ ├── stg_gsheets__franchise_actives.yml |
| 45 | +│ │ ├── stg_gsheets__franchise_general_managers.yml |
| 46 | +│ │ └── stg_gsheets__franchise_head_coaches.sql |
| 47 | +│ ├── warehouse |
| 48 | +│ │ ├── dimensions |
| 49 | +│ │ │ ├── _dimensions__docs.md |
| 50 | +│ │ │ ├── _dimensions__models.yml |
| 51 | +│ │ │ ├── dim_calendar_dates_.sql |
| 52 | +│ │ │ ├── dim_games.sql |
| 53 | +│ │ │ └── dim_teams.sql |
| 54 | +│ │ └── facts |
| 55 | +│ │ ├── _facts__docs.yml |
| 56 | +│ │ ├── _facts__models.yml |
| 57 | +│ │ └── fct_games_played.sql |
| 58 | +├── README.md |
| 59 | +├── dbt_project.yml |
| 60 | +├── packages.yml |
| 61 | +└── requirements.txt |
| 62 | +``` |
| 63 | +- All objects should be plural, such as: `stg_nba__teams` |
| 64 | +- Staging models are 1:1 with each source table and named with the following convention: `stg_<source>__<table_name>.sql` |
| 65 | + - [Additional context on Staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging) |
| 66 | +- Marts contain all of the useful data about a _particular entity_ at a granular level and should lean towards being wide and denormalized. |
| 67 | + - [Additional context on Marts models](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts) |
| 68 | +- Intermediate tables (if needed) should help break apart complex or lengthy logic and follow the following convention: `int_[entity]s_[verb]s.sql` |
| 69 | + - [Additional context on Intermediate models](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate) |
| 70 | + |
| 71 | + |
| 72 | +## Model configuration |
| 73 | + |
| 74 | +- Model-specific attributes (like sort/dist keys) should be specified in the model. |
| 75 | +- If a particular configuration applies to all models in a directory, it should be specified in the `dbt_project.yml` file. |
| 76 | +- In-model configurations should be specified like this: |
| 77 | + |
| 78 | +```python |
| 79 | +{{ |
| 80 | + config( |
| 81 | + materialized = 'table', |
| 82 | + sort = 'id', |
| 83 | + dist = 'id' |
| 84 | + ) |
| 85 | +}} |
| 86 | +``` |
| 87 | +- Marts should always be configured as tables |
| 88 | + |
| 89 | +## dbt conventions |
| 90 | +* Only `stg_` models (or `base_` models if your project requires them) should select from `source`s. |
| 91 | +* All other models should only select from other models. |
| 92 | + |
| 93 | +## Testing |
| 94 | +- Every subdirectory should contain a `.yml` file, in which each model in the subdirectory is tested. For staging folders, there will be both `_sourcename__sources.yml` as well as `_sourcename__models.yml`. For other folders, the structure should be `_foldername__models.yml` (example `_finance__models.yml`). |
| 95 | +- At a minimum, unique and not_null tests should be applied to the primary key of each model. |
| 96 | + |
| 97 | +## Naming and field conventions |
| 98 | + |
| 99 | +* Schema, table and column names should be in `snake_case`. |
| 100 | +* Use names based on the _business_ terminology, rather than the source terminology. |
| 101 | +* Each model should have a primary key. |
| 102 | +* The primary key of a model should be named `<object>_id`, e.g. `account_id` – this makes it easier to know what `id` is being referenced in downstream joined models. |
| 103 | +* For base/staging models, fields should be ordered in categories, where identifiers are first and timestamps are at the end. |
| 104 | +* Timestamp columns should be named `<event>_at`, e.g. `created_at`, and should be in UTC. If a different timezone is being used, this should be indicated with a suffix, e.g `created_at_pt`. |
| 105 | +* Booleans should be prefixed with `is_` or `has_`. |
| 106 | +* Price/revenue fields should be in decimal currency (e.g. `19.99` for $19.99; many app databases store prices as integers in cents). If non-decimal currency is used, indicate this with suffix, e.g. `price_in_cents`. |
| 107 | +* Avoid reserved words as column names |
| 108 | +* Consistency is key! Use the same field names across models where possible, e.g. a key to the `customers` table should be named `customer_id` rather than `user_id`. |
| 109 | + |
| 110 | +## CTEs |
| 111 | + |
| 112 | +For more information about why we use so many CTEs, check out [this discourse post](https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091). |
| 113 | + |
| 114 | +- All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file |
| 115 | +- Where performance permits, CTEs should perform a single, logical unit of work. |
| 116 | +- CTE names should be as verbose as needed to convey what they do |
| 117 | +- CTEs with confusing or noteable logic should be commented |
| 118 | +- CTEs that are duplicated across models should be pulled out into their own models |
| 119 | +- create a `final` or similar CTE that you select from as your last line of code. This makes it easier to debug code within a model (without having to comment out code!) |
| 120 | +- CTEs should be formatted like this: |
| 121 | + |
| 122 | +``` sql |
| 123 | +with |
| 124 | + |
| 125 | +events as ( |
| 126 | + |
| 127 | + ... |
| 128 | + |
| 129 | +), |
| 130 | + |
| 131 | +-- CTE comments go here |
| 132 | +filtered_events as ( |
| 133 | + |
| 134 | + ... |
| 135 | + |
| 136 | +) |
| 137 | + |
| 138 | +select * from filtered_events |
| 139 | +``` |
| 140 | + |
| 141 | +## SQL style guide |
| 142 | + |
| 143 | +- Use trailing commas |
| 144 | +- Indents should be four spaces (except for predicates, which should line up with the `where` keyword) |
| 145 | +- Lines of SQL should be no longer than [80 characters](https://stackoverflow.com/questions/29968499/vertical-rulers-in-visual-studio-code) |
| 146 | +- Field names and function names should all be lowercase |
| 147 | +- The `as` keyword should be used when aliasing a field or table |
| 148 | +- Fields should be stated before aggregates / window functions |
| 149 | +- Aggregations should be executed as early as possible before joining to another table. |
| 150 | +- Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see [this rant](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/) for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design. |
| 151 | +- Prefer `union all` to `union` [*](http://docs.aws.amazon.com/redshift/latest/dg/c_example_unionall_query.html) |
| 152 | +- Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers". |
| 153 | +- If joining two or more tables, _always_ prefix your column names with the table alias. If only selecting from one table, prefixes are not needed. |
| 154 | +- Be explicit about your join (i.e. write `inner join` instead of `join`). `left joins` are normally the most useful, `right joins` often indicate that you should change which table you select `from` and which one you `join` to. |
| 155 | + |
| 156 | +- *DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP, BRAIN TIME IS EXPENSIVE* |
| 157 | + |
| 158 | +### Example SQL |
| 159 | +```sql |
| 160 | +with |
| 161 | + |
| 162 | +my_data as ( |
| 163 | + |
| 164 | + select * from {{ ref('my_data') }} |
| 165 | + |
| 166 | +), |
| 167 | + |
| 168 | +some_cte as ( |
| 169 | + |
| 170 | + select * from {{ ref('some_cte') }} |
| 171 | + |
| 172 | +), |
| 173 | + |
| 174 | +some_cte_agg as ( |
| 175 | + |
| 176 | + select |
| 177 | + id, |
| 178 | + sum(field_4) as total_field_4, |
| 179 | + max(field_5) as max_field_5 |
| 180 | + |
| 181 | + from some_cte |
| 182 | + group by 1 |
| 183 | + |
| 184 | +), |
| 185 | + |
| 186 | +final as ( |
| 187 | + |
| 188 | + select [distinct] |
| 189 | + my_data.field_1, |
| 190 | + my_data.field_2, |
| 191 | + my_data.field_3, |
| 192 | + |
| 193 | + -- use line breaks to visually separate calculations into blocks |
| 194 | + case |
| 195 | + when my_data.cancellation_date is null |
| 196 | + and my_data.expiration_date is not null |
| 197 | + then expiration_date |
| 198 | + when my_data.cancellation_date is null |
| 199 | + then my_data.start_date + 7 |
| 200 | + else my_data.cancellation_date |
| 201 | + end as cancellation_date, |
| 202 | + |
| 203 | + some_cte_agg.total_field_4, |
| 204 | + some_cte_agg.max_field_5 |
| 205 | + |
| 206 | + from my_data |
| 207 | + left join some_cte_agg |
| 208 | + on my_data.id = some_cte_agg.id |
| 209 | + where my_data.field_1 = 'abc' |
| 210 | + and ( |
| 211 | + my_data.field_2 = 'def' or |
| 212 | + my_data.field_2 = 'ghi' |
| 213 | + ) |
| 214 | + having count(*) > 1 |
| 215 | + |
| 216 | +) |
| 217 | + |
| 218 | +select * from final |
| 219 | + |
| 220 | +``` |
| 221 | + |
| 222 | +- Your join should list the "left" table first (i.e. the table you are selecting `from`): |
| 223 | +```sql |
| 224 | +select |
| 225 | + trips.*, |
| 226 | + drivers.rating as driver_rating, |
| 227 | + riders.rating as rider_rating |
| 228 | + |
| 229 | +from trips |
| 230 | +left join users as drivers |
| 231 | + on trips.driver_id = drivers.user_id |
| 232 | +left join users as riders |
| 233 | + on trips.rider_id = riders.user_id |
| 234 | + |
| 235 | +``` |
| 236 | + |
| 237 | +## YAML style guide |
| 238 | + |
| 239 | +* Indents should be two spaces |
| 240 | +* List items should be indented |
| 241 | +* Use a new line to separate list items that are dictionaries where appropriate |
| 242 | +* Lines of YAML should be no longer than 80 characters. |
| 243 | + |
| 244 | +### Example YAML |
| 245 | +```yaml |
| 246 | +version: 2 |
| 247 | + |
| 248 | +models: |
| 249 | + - name: events |
| 250 | + columns: |
| 251 | + - name: event_id |
| 252 | + description: This is a unique identifier for the event |
| 253 | + tests: |
| 254 | + - unique |
| 255 | + - not_null |
| 256 | + |
| 257 | + - name: event_time |
| 258 | + description: "When the event occurred in UTC (eg. 2018-01-01 12:00:00)" |
| 259 | + tests: |
| 260 | + - not_null |
| 261 | + |
| 262 | + - name: user_id |
| 263 | + description: The ID of the user who recorded the event |
| 264 | + tests: |
| 265 | + - not_null |
| 266 | + - relationships: |
| 267 | + to: ref('users') |
| 268 | + field: id |
| 269 | +``` |
| 270 | +
|
| 271 | +
|
| 272 | +## Jinja style guide |
| 273 | +
|
| 274 | +* When using Jinja delimiters, use spaces on the inside of your delimiter, like `{{ this }}` instead of `{{this}}` |
| 275 | +* Use newlines to visually indicate logical blocks of Jinja |
| 276 | + |
| 277 | + |
| 278 | +## Helpful Reference Links |
| 279 | +* https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview |
| 280 | +* https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091 |
| 281 | +* https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/ |
| 282 | +* https://docs.getdbt.com/docs/about/viewpoint |
| 283 | +* https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md |
0 commit comments