Skip to content

Commit 8d6a232

Browse files
author
Student
committed
add style guide
1 parent 1984798 commit 8d6a232

File tree

1 file changed

+283
-0
lines changed

1 file changed

+283
-0
lines changed

_project_docs/style_guide.md

Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# dbt Style Guide
2+
3+
## Model Naming
4+
Our models (typically) fit into three main categories: staging, warehouse, marts. For more detail about aspects of this structure, check out [the dbt best practices](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview).
5+
6+
The file and naming structures are as follows (example):
7+
```
8+
ssp_analytics
9+
├── .github
10+
│ ├── workflows
11+
│ │ ├── ci.yml
12+
│ │ ├── daily_refresh.yml
13+
│ │ └── post_merge_deploy.yml
14+
│ └── pull_request_template.md
15+
├── _project_docs
16+
│ ├── automation
17+
│ │ │ └── profiles.yml
18+
│ └── style_guide.md
19+
├── analyses
20+
├── seeds
21+
│ └── some_data.csv
22+
├── snapshots
23+
├── tests
24+
│ └── assert_some_test_scenario.sql
25+
├── macros
26+
│ ├── _macros__definitions.yml
27+
│ ├── _macros__docs.md
28+
│ └── generate_schema_name.sql
29+
├── models
30+
│ ├── marts
31+
│ │ ├── _marts__docs.md
32+
│ │ ├── _marts__models.yml
33+
│ │ └── nba_games_detail.sql
34+
│ ├── staging
35+
│ │ ├── nba
36+
│ │ │ ├── _nba__docs.md
37+
│ │ │ ├── _nba__models.yml
38+
│ │ │ ├── _nba__sources.yml
39+
│ │ │ ├── stg_nba__games.sql
40+
│ │ │ └── stg_nba__teams.sql
41+
│ │ └── gsheets
42+
│ │ ├── _gsheets__models.yml
43+
│ │ ├── _gsheets__sources.yml
44+
│ │ ├── stg_gsheets__franchise_actives.yml
45+
│ │ ├── stg_gsheets__franchise_general_managers.yml
46+
│ │ └── stg_gsheets__franchise_head_coaches.sql
47+
│ ├── warehouse
48+
│ │ ├── dimensions
49+
│ │ │ ├── _dimensions__docs.md
50+
│ │ │ ├── _dimensions__models.yml
51+
│ │ │ ├── dim_calendar_dates_.sql
52+
│ │ │ ├── dim_games.sql
53+
│ │ │ └── dim_teams.sql
54+
│ │ └── facts
55+
│ │ ├── _facts__docs.yml
56+
│ │ ├── _facts__models.yml
57+
│ │ └── fct_games_played.sql
58+
├── README.md
59+
├── dbt_project.yml
60+
├── packages.yml
61+
└── requirements.txt
62+
```
63+
- All objects should be plural, such as: `stg_nba__teams`
64+
- Staging models are 1:1 with each source table and named with the following convention: `stg_<source>__<table_name>.sql`
65+
- [Additional context on Staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging)
66+
- Marts contain all of the useful data about a _particular entity_ at a granular level and should lean towards being wide and denormalized.
67+
- [Additional context on Marts models](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts)
68+
- Intermediate tables (if needed) should help break apart complex or lengthy logic and follow the following convention: `int_[entity]s_[verb]s.sql`
69+
- [Additional context on Intermediate models](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate)
70+
71+
72+
## Model configuration
73+
74+
- Model-specific attributes (like sort/dist keys) should be specified in the model.
75+
- If a particular configuration applies to all models in a directory, it should be specified in the `dbt_project.yml` file.
76+
- In-model configurations should be specified like this:
77+
78+
```python
79+
{{
80+
config(
81+
materialized = 'table',
82+
sort = 'id',
83+
dist = 'id'
84+
)
85+
}}
86+
```
87+
- Marts should always be configured as tables
88+
89+
## dbt conventions
90+
* Only `stg_` models (or `base_` models if your project requires them) should select from `source`s.
91+
* All other models should only select from other models.
92+
93+
## Testing
94+
- Every subdirectory should contain a `.yml` file, in which each model in the subdirectory is tested. For staging folders, there will be both `_sourcename__sources.yml` as well as `_sourcename__models.yml`. For other folders, the structure should be `_foldername__models.yml` (example `_finance__models.yml`).
95+
- At a minimum, unique and not_null tests should be applied to the primary key of each model.
96+
97+
## Naming and field conventions
98+
99+
* Schema, table and column names should be in `snake_case`.
100+
* Use names based on the _business_ terminology, rather than the source terminology.
101+
* Each model should have a primary key.
102+
* The primary key of a model should be named `<object>_id`, e.g. `account_id` – this makes it easier to know what `id` is being referenced in downstream joined models.
103+
* For base/staging models, fields should be ordered in categories, where identifiers are first and timestamps are at the end.
104+
* Timestamp columns should be named `<event>_at`, e.g. `created_at`, and should be in UTC. If a different timezone is being used, this should be indicated with a suffix, e.g `created_at_pt`.
105+
* Booleans should be prefixed with `is_` or `has_`.
106+
* Price/revenue fields should be in decimal currency (e.g. `19.99` for $19.99; many app databases store prices as integers in cents). If non-decimal currency is used, indicate this with suffix, e.g. `price_in_cents`.
107+
* Avoid reserved words as column names
108+
* Consistency is key! Use the same field names across models where possible, e.g. a key to the `customers` table should be named `customer_id` rather than `user_id`.
109+
110+
## CTEs
111+
112+
For more information about why we use so many CTEs, check out [this discourse post](https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091).
113+
114+
- All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file
115+
- Where performance permits, CTEs should perform a single, logical unit of work.
116+
- CTE names should be as verbose as needed to convey what they do
117+
- CTEs with confusing or noteable logic should be commented
118+
- CTEs that are duplicated across models should be pulled out into their own models
119+
- create a `final` or similar CTE that you select from as your last line of code. This makes it easier to debug code within a model (without having to comment out code!)
120+
- CTEs should be formatted like this:
121+
122+
``` sql
123+
with
124+
125+
events as (
126+
127+
...
128+
129+
),
130+
131+
-- CTE comments go here
132+
filtered_events as (
133+
134+
...
135+
136+
)
137+
138+
select * from filtered_events
139+
```
140+
141+
## SQL style guide
142+
143+
- Use trailing commas
144+
- Indents should be four spaces (except for predicates, which should line up with the `where` keyword)
145+
- Lines of SQL should be no longer than [80 characters](https://stackoverflow.com/questions/29968499/vertical-rulers-in-visual-studio-code)
146+
- Field names and function names should all be lowercase
147+
- The `as` keyword should be used when aliasing a field or table
148+
- Fields should be stated before aggregates / window functions
149+
- Aggregations should be executed as early as possible before joining to another table.
150+
- Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see [this rant](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/) for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design.
151+
- Prefer `union all` to `union` [*](http://docs.aws.amazon.com/redshift/latest/dg/c_example_unionall_query.html)
152+
- Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers".
153+
- If joining two or more tables, _always_ prefix your column names with the table alias. If only selecting from one table, prefixes are not needed.
154+
- Be explicit about your join (i.e. write `inner join` instead of `join`). `left joins` are normally the most useful, `right joins` often indicate that you should change which table you select `from` and which one you `join` to.
155+
156+
- *DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP, BRAIN TIME IS EXPENSIVE*
157+
158+
### Example SQL
159+
```sql
160+
with
161+
162+
my_data as (
163+
164+
select * from {{ ref('my_data') }}
165+
166+
),
167+
168+
some_cte as (
169+
170+
select * from {{ ref('some_cte') }}
171+
172+
),
173+
174+
some_cte_agg as (
175+
176+
select
177+
id,
178+
sum(field_4) as total_field_4,
179+
max(field_5) as max_field_5
180+
181+
from some_cte
182+
group by 1
183+
184+
),
185+
186+
final as (
187+
188+
select [distinct]
189+
my_data.field_1,
190+
my_data.field_2,
191+
my_data.field_3,
192+
193+
-- use line breaks to visually separate calculations into blocks
194+
case
195+
when my_data.cancellation_date is null
196+
and my_data.expiration_date is not null
197+
then expiration_date
198+
when my_data.cancellation_date is null
199+
then my_data.start_date + 7
200+
else my_data.cancellation_date
201+
end as cancellation_date,
202+
203+
some_cte_agg.total_field_4,
204+
some_cte_agg.max_field_5
205+
206+
from my_data
207+
left join some_cte_agg
208+
on my_data.id = some_cte_agg.id
209+
where my_data.field_1 = 'abc'
210+
and (
211+
my_data.field_2 = 'def' or
212+
my_data.field_2 = 'ghi'
213+
)
214+
having count(*) > 1
215+
216+
)
217+
218+
select * from final
219+
220+
```
221+
222+
- Your join should list the "left" table first (i.e. the table you are selecting `from`):
223+
```sql
224+
select
225+
trips.*,
226+
drivers.rating as driver_rating,
227+
riders.rating as rider_rating
228+
229+
from trips
230+
left join users as drivers
231+
on trips.driver_id = drivers.user_id
232+
left join users as riders
233+
on trips.rider_id = riders.user_id
234+
235+
```
236+
237+
## YAML style guide
238+
239+
* Indents should be two spaces
240+
* List items should be indented
241+
* Use a new line to separate list items that are dictionaries where appropriate
242+
* Lines of YAML should be no longer than 80 characters.
243+
244+
### Example YAML
245+
```yaml
246+
version: 2
247+
248+
models:
249+
- name: events
250+
columns:
251+
- name: event_id
252+
description: This is a unique identifier for the event
253+
tests:
254+
- unique
255+
- not_null
256+
257+
- name: event_time
258+
description: "When the event occurred in UTC (eg. 2018-01-01 12:00:00)"
259+
tests:
260+
- not_null
261+
262+
- name: user_id
263+
description: The ID of the user who recorded the event
264+
tests:
265+
- not_null
266+
- relationships:
267+
to: ref('users')
268+
field: id
269+
```
270+
271+
272+
## Jinja style guide
273+
274+
* When using Jinja delimiters, use spaces on the inside of your delimiter, like `{{ this }}` instead of `{{this}}`
275+
* Use newlines to visually indicate logical blocks of Jinja
276+
277+
278+
## Helpful Reference Links
279+
* https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview
280+
* https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091
281+
* https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/
282+
* https://docs.getdbt.com/docs/about/viewpoint
283+
* https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md

0 commit comments

Comments
 (0)