diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 1 - Lab.md b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 1 - Lab.md new file mode 100644 index 00000000..f94a45fe --- /dev/null +++ b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 1 - Lab.md @@ -0,0 +1,281 @@ +# Day 1 - Lab + + + +We’re gonna be working with this table. + +```sql +SELECT * FROM player_seasons; +``` + +The problem with this table is that the player has duplicate for each seasons. + +![image.png](images/d1la_image.png) + +Due to this temporal problem, if we were to join it with something downstream, it would cause shuffling to happen and we would lose compression. + +We wanna create a table with an array of all their seasons and related season stats, i.e. all those dimensions that change season by season. + +So player dimensions (height, name, college etc…) we keep in the base table as normal attributes because those don’t change through time. + +We create a `season_stats` struct type, that will be the item of the “seasons array”. In other words, each player will have an array of `season_stats`. + +```sql +CREATE TYPE season_stats AS +( + season integer, + gp integer, -- games played + pts real, -- points + reb real, -- rebounds + ast real -- assists +); +``` + +The next step is to create the `players` table. + +```sql +CREATE TABLE IF NOT EXISTS players +( + player_name text, + height text, + college text, + country text, + draft_year text, + draft_round text, + draft_number text, + season_stats season_stats[], -- !!! the ARRAY !!! + current_season integer, + PRIMARY KEY (player_name, current_season) +) +``` + +Notice we add a `current_season` column, this is because **we are building this table cumulatively**, so as we do the outer join, `current_season` will be whatever the **latest value** in the seasons table is (will make more sense later). + +Now we’re gonna work on the `FULL OUTER JOIN` logic. + +```sql +WITH +yesterday AS ( + SELECT * FROM players + WHERE current_season = 1995 -- 1 year before 1st season available +), + +today AS ( + SELECT * FROM player_seasons + WHERE season = 1996 +) + +-- this will give us the **cumulation** between "today" and "yesterday". + +SELECT * +FROM today t +FULL OUTER JOIN yesterday y + ON t.player_name = y.player_name +``` + +Running this query you will notice that all values from `yesterday` are `NULL`. + +So we want to `COALESCE` the values that are not temporal, i.e. that don’t change through time. + +```sql +-- [..] + +SELECT + COALESCE(t.player_name, y.player_name) AS player_name, + COALESCE(t.height, y.height) AS height, + COALESCE(t.college, y.college) AS college, + COALESCE(t.country, y.country) AS country, + COALESCE(t.draft_year, y.draft_year) AS draft_year, + COALESCE(t.draft_round, y.draft_round) AS draft_round, + COALESCE(t.draft_number, y.draft_number) AS draft_number +FROM today t +FULL OUTER JOIN yesterday y + ON t.player_name = y.player_name +``` + +Obviously this doesn’t do a lot yet, but it’s the basis of cumulation, we we start building the `seasons` array. + +```sql +-- [..] +COALESCE(t.draft_number, y.draft_number) AS draft_number, +CASE + WHEN y.season_stats IS NULL -- if yesterday has no stats we create + THEN ARRAY[ROW( + t.season, + t.gp, + t.pts, + t.reb, + t.ast + )::season_stats] + WHEN t.season IS NOT NULL -- if player is still playing we append + THEN y.season_stats || ARRAY[ROW( + t.season, + t.gp, + t.pts, + t.reb, + t.ast + )::season_stats] + ELSE y.season_stats -- if player is not playing anymore we don't add anything +END AS season_stats, + +-- this is gonna give us the current season value, as it takes either +-- the current season (t.season) or "yesterday's" season (y.) plus one. +COALESCE(t.season, y.current_season + 1) AS current_season; +``` + +We now turn this into an insert statement, to start the cumulation. + +```sql +INSERT INTO players + +WITH +yesterday AS ( + SELECT * FROM players + WHERE current_season = 1995 +), + +today AS ( + SELECT * FROM player_seasons + WHERE season = 1996 +) + +SELECT + COALESCE(t.player_name, y.player_name) AS player_name, + COALESCE(t.height, y.height) AS height, + COALESCE(t.college, y.college) AS college, + COALESCE(t.country, y.country) AS country, + COALESCE(t.draft_year, y.draft_year) AS draft_year, + COALESCE(t.draft_round, y.draft_round) AS draft_round, + COALESCE(t.draft_number, y.draft_number) AS draft_number + CASE + WHEN y.season_stats IS NULL -- if yesterday has no stats we create + THEN ARRAY[ROW( + t.season, + t.gp, + t.pts, + t.reb, + t.ast + )::season_stats] + WHEN t.season IS NOT NULL -- if player is still playing we append + THEN y.season_stats || ARRAY[ROW( + t.season, + t.gp, + t.pts, + t.reb, + t.ast + )::season_stats] + ELSE y.season_stats -- if player is not playing anymore we don't add anything + END AS season_stats, + COALESCE(t.season, y.current_season + 1) AS current_season; + +FROM today t +FULL OUTER JOIN yesterday y + ON t.player_name = y.player_name +``` + +If you know repeat this operation bumping the year by 1 each time, you will start cumulating the values, season by season. For instance, this is what you’re going to see after 2 cumulations (when `current_season = 1997`) + +![image.png](images/d1la_image%201.png) + +See the player highlighted in yellow, only has 1 element in the array? It’s because he joined in 1997, and didn’t play in 1996. + +The table `players` can easily be turned back into `player_seasons`. + +```sql +WITH + +unnested AS ( + SELECT + player_name, + UNNEST(season_stats)::season_stats as season_stats + FROM players + WHERE current_season = 2001 + AND player_name = 'Michael Jordan' +) + +SELECT player_name, (season_stats::season_stats).* +FROM unnested; +``` + +![image.png](images/d1la_image%202.png) + +Another secondary benefit of this is that when you unnest this, it’s already gonna be sorted, which can take advantage of run length encoding compression. Try removing ‘Michael Jordan’ from the filter. + +![image.png](images/d1la_image%203.png) + +Let’s now drop the players table so we can do some extra stuff. We want to create a `scoring_class` column that is based on the points a player scores. + +```sql +DROP TABLE players + +CREATE TYPE scoring_class AS ENUM('star', 'good', 'average', 'bad'); + +CREATE TABLE IF NOT EXISTS players +( + player_name text, + height text, + college text, + country text, + draft_year text, + draft_round text, + draft_number text, + season_stats season_stats[], + scoring_class scoring_class, + years_since_last_season integer, + current_season integer, + PRIMARY KEY (player_name, current_season) +); +``` + +Now to our previous cumulative query we need to add 2 more columns, right before `current_season`. + +```sql +-- [..] + +CASE + WHEN t.season IS NOT NULL THEN -- if they're active this season, give them score + CASE WHEN t.pts > 20 THEN 'star' + WHEN t.pts > 15 THEN 'good' + WHEN t.pts > 10 THEN 'average' + ELSE 'bad' + END::scoring_class + ELSE y.scoring_class -- else keep previous score +END AS scoring_class, + +CASE + WHEN t.season IS NOT NULL THEN 0 -- if they're active, then 0 years since last season + ELSE y.years_since_last_season + 1 +END AS years_since_last_season, + +COALESCE(t.season, y.current_season + 1) AS current_season + +FROM today t +FULL OUTER JOIN yesterday y + ON t.player_name = y.player_name; +``` + +Then repeat the same thing we did above, cumulating the table by running the query year by year until 2001 or so. + +Let’s now run some analytics: + +```sql +SELECT + player_name, + season_stats[CARDINALITY(season_stats)].pts + / CASE WHEN season_stats[1].pts = 0 THEN 1 ELSE season_stats[1].pts END + AS improvement +FROM players +WHERE current_season = 2001 +ORDER BY 2 DESC +``` + +This gives us which player has improved the most from their 1st season to their latest one. + +Notice that this query doesn’t have a `GROUP BY`, whereas normally this would be done by doing some kind of aggregation, a `min` and `max` probably. This one has none of this, which means it is insanely fast (if you exclude the `ORDER BY`) and has **NO SHUFFLE**. + +In ‘map-reduce’ terms, this query can be done exclusively with a **map** step, and no reduce, which means it can be parallelized infinitely. diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 1 - Lecture.md b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 1 - Lecture.md new file mode 100644 index 00000000..619419e3 --- /dev/null +++ b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 1 - Lecture.md @@ -0,0 +1,169 @@ +# Day 1 - Lecture + +# Intro + +Dimensions are attributes of an entity. Some of them may IDENTIFY an entity (e.g. user’s ID). Other dimensions are just attributes. + +Dimensions come in two flavors + +- Slowly-changing → May change through time, e.g. favorite food. +- Fixed → Can’t change through time, e.g. birth date. + + + +# Knowing your customer + +First thing when modeling data is to think “who’s gonna use this?”. It’s an exercise in empathy. + +- **Data analysts / data scientists** → Easy data to query. Flat. Not many complex data types. Don’t wanna make their job hard with these datasets. +- **Other data engineers** → Should be compact and probably harder to query. Nested types are ok. + - *Master data set* → a dataset upon which many other are built. +- **ML models** → depends on the model and how its trained. But most times it’s ID and flat data. +- **Customers** → Should be very easy to interpret. Charts. + +In short, you have to understand how the data is being used. Otherwise, you’re gonna waste a lot of time and money. + +# OLTP vs master data vs OLAP + +- **OLTP** (online transaction processing) → mostly outside the realm of data engineers, you know what this is. Optimized for low-latency, low-volume queries. +- **OLAP** (online analytical processing) → most common data modeling for data engineers. Optimize for large volume, GROUP BY queries, minimizes JOINs. + +Obviously these two models are opposite to each other and incompatible in functions. This is where *master data* comes in help. + +- **Master data** → Entities deduped and with very compete definitions for your entities. Sits in the middle between OLTP and OLAP. + +## Mismatching needs = less business value + +Some of the biggest problems in DE occur when data is modeled for the wrong customer. + +E.g. modeling an OLTP system like OLAP would make it very slow and inefficient, as the queries would fetch way more data than necessary. In the opposite case, it also would be inefficient but for different reasons. One is gonna end up with a lot of JOINs and shuffles that expensive and slow. + +Master data can be plugged in the middle to help make the transition between OLTP and OLAP. + +```mermaid +flowchart LR; + id1[Prod DB Snapshots] --> id2[Master Data] --> id3[OLAP Cubes] --> id4[Metrics] +``` + +- Prod DB → 40 normalized tables +- Master data → Unified table but still normalized and dedupet. Very complete +- OLAP Cubes → Flatten the data out, multiple rows per entity. This is where one does aggregates, group bys etc… +- Metrics → Aggregate an OLAP Cube even more and get 1 number. + +# Cumulative Table Design + +One thing that can happen when building master data is that some days, not every user is gonna show up *[this is assuming that our data effectively includes user — Ed.].* + +Master data should keep all history. A cumulative table is all about holding on all the dimensions that ever existed (maybe up until a point). + +**Core components** + +- 2 dataframes (yesterday and today) +- `FULL OUTER JOIN` the two data frames together → The reason is that yesterday’s data might not be present in today’s data and viceversa. With this, you get the whole set. +- `COALESCE` values to keep everything. +- This way you can hang onto all of history. + +**Usages** + +- **Growth analytics** at Facebook (`dim_all_users`). Also used as the master source for all users’ data. +- **State transition tracking** → in practice, it keeps track of a state of a user across time. E.g. was active, now inactive: *churned*. Was inactive, now active: *resurrected*, and so on. + +![image.png](images/d1le_image.png) + +If you’re starting cumulation today, yesterday will be NULL. Otherwise, it contains all the user history from when you started cumulation. + +In the case of Facebook, for instance, you can kick out of the table a user that has been inactive for the past 180 days (this is just ONE of the possible filters). + +Pruning the table is necessary, otherwise the size of this table would spiral out of control, when dealing with sizes such as Facebook’s. + +This kind of table allows to easily calculate cumulative metrics, like `days_since_last_active`. Each time you compute the cumulative table, you can for instance just add `+1` to that field if that’s the case. + +**Strenghts** + +- Historical analysis without shuffles (GROUP BYs, JOINs etc…) +- Easy “transition” analysis + +**Drawbacks** + +- Can only be backfilled sequentially +- Handling PII data can be a mess since deleted/inactive users get carried forward + +# Compactness vs usability tradeoff + +- **The most usable tables** → Straightforward. Very easy to manipulate with WHERE and GROUP BY. No complex data types. +- **The most compact tables** → Not human readable. Usually they have an ID and some blob of data to keep it as compressed as possible, and can’t really be queried until they’re decoded. +- **The middle-ground tables** → Use complex data types (e.g. ARRAY, MAP, STRUCT) making querying trickier but also compacting more. + +Each of them have their own use case. The most usable tables are analytics focused, whereas the most compact ones are more SWE / production data focused. + +## When would you use each type? + +- **Most compact** + - Online systems where latency and data volumes matter a lot. Consumers are usually highly technical. +- **Middle ground** + - Upstream staging / master data where the majority of consumers are other data engineers (that might be creating other tables for other people) +- **Most usable** + - When analytics is the main consumer and the majority of consumers are less technical + +## Struct vs Array vs Map + +- **Struct** → Like a table inside a table + - Keys are rigidly defined, compression is good! + - Values can be any type +- **Map** + - Keys are loosely defined, compression is ok! + - All values have to be the same type +- **Array** + - Ordinal + - List of values that all have to be the same type (but these could be also Maps of Structs) + +# Temporal Cardinality Explosions of Dimensions + +When you add a temporal aspect to your dimensions and the cardinality increases by at least 1 order of magnitude. + +Example: In AirBnb you have 6 million “listings”, but a listing has a calendar and the calendar has “nights”, so you end up having a `listing_night`, which is its own entity, in some way. + +How do you model this? As 6 million listings? Or 2 billion nights? + +- If we want to know the nightly pricing pricing and availability of each night for the next year, that’s 365 * 6 million or about ~2 billion nights. + +Should this dataset be: + +- Listing-level with an array of nights? +- Listhing_night level with 2 billion rows? + +By **doing the sorting right,** Parquet will keep these two about the same size. + +## Badness of denormalized temporal dimensions + +If you explode it out (the previous example, i.e. at the `listing_night` lvl) and need to join other dimensions, shuffle will ruin your compression, because run-length encoding won’t work well in this case. + +![image.png](images/d1le_image%201.png) + +Same data, run-length encoded: + +![image.png](images/d1le_image%202.png) + +After a join (Spark Shuffle): + +![image.png](images/d1le_image%203.png) + +**Two ways to solve this problem:** + +1. After joining this dataset, re-sort it. *[Zach doesn’t recommend this one, says only sort you data once, and then if someone else does it instead of you, you have to tell them to resort the data, not convenient. — Ed.]* +2. Instead of having all these player names and seasons broken out on separate rows, have 1 row per player name, and then an array of seasons. Here we can join on player name, and then AFTER the join, we can explode out the seasons array, and it keeps the sorting! + +> Remember that Spark Shuffle fucks with the ordering of the data. It’s good for join but can mess with your table size. +> \ No newline at end of file diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 2 - Lab.md b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 2 - Lab.md new file mode 100644 index 00000000..3cd9d054 --- /dev/null +++ b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 2 - Lab.md @@ -0,0 +1,428 @@ +# Day 2 - Lab + +In this lab, we will take the datasets created in first lab and convert them in SCDs of type 2. + +First we need to recreate the `players` table because Zach forgot to add something during first lab. + +```sql +DROP TABLE players; + +CREATE TABLE players +( + player_name text, + height text, + college text, + country text, + draft_year text, + draft_round text, + draft_number text, + season_stats season_stats[], + scoring_class scoring_class, + years_since_last_season integer, + current_season integer, + is_active boolean, -- ADDED COLUMN + PRIMARY KEY (player_name, current_season) +); + +-- All this crazy query does is it recreates the `player` table +-- as we had it in lab 1 and adds the `is_active` column. +-- Just copy paste it and run it +INSERT INTO players +WITH +years AS ( + SELECT * + FROM GENERATE_SERIES(1996, 2021) AS season +), + +p AS ( + SELECT + player_name, + MIN(season) AS first_season + FROM player_seasons + GROUP BY player_name +), + +players_and_seasons AS ( + SELECT * + FROM p + JOIN years y + ON p.first_season <= y.season +), + +windowed AS ( + SELECT + pas.player_name, + pas.season, + ARRAY_REMOVE( + ARRAY_AGG( + CASE + WHEN ps.season IS NOT NULL + THEN ROW( + ps.season, + ps.gp, + ps.pts, + ps.reb, + ps.ast + )::season_stats + END) + OVER (PARTITION BY pas.player_name ORDER BY COALESCE(pas.season, ps.season)), + NULL + ) AS seasons + FROM players_and_seasons pas + LEFT JOIN player_seasons ps + ON pas.player_name = ps.player_name + AND pas.season = ps.season + ORDER BY pas.player_name, pas.season +), + +static AS ( + SELECT + player_name, + MAX(height) AS height, + MAX(college) AS college, + MAX(country) AS country, + MAX(draft_year) AS draft_year, + MAX(draft_round) AS draft_round, + MAX(draft_number) AS draft_number + FROM player_seasons + GROUP BY player_name +) + +SELECT + w.player_name, + s.height, + s.college, + s.country, + s.draft_year, + s.draft_round, + s.draft_number, + seasons AS season_stats, + CASE + WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 20 THEN 'star' + WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 15 THEN 'good' + WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 10 THEN 'average' + ELSE 'bad' + END::scoring_class AS scoring_class, + w.season - (seasons[CARDINALITY(seasons)]::season_stats).season AS years_since_last_active, + w.season, + (seasons[CARDINALITY(seasons)]::season_stats).season = season AS is_active +FROM windowed w +JOIN static s + ON w.player_name = s.player_name; +``` + +# SCD2 - full scan + +Let’s now create the SCD table for the players. We want to track changes in two columns: `scoring_class` and `is_active`. + +```sql +CREATE TABLE players_scd ( + player_name text, + scoring_class scoring_class, + is_active boolean, + start_season integer, + end_season integer, + current_season integer, -- this can be thought of the "date" partition + PRIMARY KEY (player_name, start_season) +); + +SELECT + player_name, + current_season, + scoring_class, + is_active, + LAG(scoring_class, 1) OVER(PARTITION BY player_name ORDER BY current_season) AS previous_scoring_class, + LAG(is_active, 1) OVER(PARTITION BY player_name ORDER BY current_season) AS previous_is_active +FROM players; +``` + +The window functions allow us to check the value of the previous season for both `scoring_class` and `is_active`. We’re building our SCD2 on based on this logic. + +We now create an indicator of whether or not one of `scoring_class` or `is_active` has changed. + +```sql +WITH +with_previous as ( + SELECT + player_name, + current_season, + scoring_class, + is_active, + LAG(scoring_class, 1) OVER(PARTITION BY player_name ORDER BY current_season) AS previous_scoring_class, + LAG(is_active, 1) OVER(PARTITION BY player_name ORDER BY current_season) AS previous_is_active + FROM players +) + +SELECT + *, + CASE + WHEN scoring_class <> previous_scoring_class THEN 1 + ELSE 0 + END AS scoring_class_change_indicator, + CASE + WHEN is_active <> previous_is_active THEN 1 + ELSE 0 + END AS is_active_change_indicator +FROM with_previous; +``` + +Let’s add this last part of the query to another CTE, and combine the two indicators into a single one, so that tracking changes becomes easier. + +```sql +-- [..] + +with_indicators AS ( + SELECT + *, + CASE + WHEN scoring_class <> previous_scoring_class THEN 1 + WHEN is_active <> previous_is_active THEN 1 + ELSE 0 + END AS change_indicator + FROM with_previous +) + +SELECT + *, + SUM(change_indicator) OVER(PARTITION BY player_name ORDER BY current_season) AS streak_identifier +FROM with_indicators; +``` + +This `streak_identifier` shows how long a player stayed the same value, over time, and also when some dimension (either `scoring_class` or `is_active`) changes. + +We add another CTE + +```sql +-- [..] + +with_streaks AS ( + SELECT + *, + SUM(change_indicator) OVER(PARTITION BY player_name ORDER BY current_season) AS streak_identifier + FROM with_indicators +) + +SELECT + player_name, + scoring_class, + is_active, + MIN(current_season) AS start_season, + MAX(current_season) AS end_season, + -- 👇 imagine in some sort of pipeline this is a parameter you inject + -- [not exactly sure what Zach means here -- Ed.] + 2021 AS current_season +FROM with_streaks +GROUP BY player_name, streak_identifier, is_active, scoring_class +ORDER BY player_name, streak_identifier +``` + +You can already see from the results that the table is almost done. For each player, every row represents a “streak” where its dimensions (`is_active` and `scoring_class`) were constant. + +The “duration” of each row (or in other words, its validity range) can be arbitrarily large (see highlighted row in screenshot). + +![image.png](images/d2la_image.png) + +So, now that we have built out the SCD2, let’s actually add it to the table. We just need to append `INSERT INTO` to the big ass query we just wrote: + +```sql +INSERT INTO players_scd + +WITH +with_previous as ( + SELECT + player_name, + current_season, + scoring_class, + is_active, + LAG(scoring_class, 1) OVER(PARTITION BY player_name ORDER BY current_season) AS previous_scoring_class, + LAG(is_active, 1) OVER(PARTITION BY player_name ORDER BY current_season) AS previous_is_active + FROM players + **WHERE current_season <= 2021** + -- we add this filter so that we can use 2022 in the incremental build +), + +with_indicators AS ( + SELECT + *, + CASE + WHEN scoring_class <> previous_scoring_class THEN 1 + WHEN is_active <> previous_is_active THEN 1 + ELSE 0 + END AS change_indicator + FROM with_previous +), + +with_streaks AS ( + SELECT + *, + SUM(change_indicator) OVER(PARTITION BY player_name ORDER BY current_season) AS streak_identifier + FROM with_indicators +) + +SELECT + player_name, + scoring_class, + is_active, + MIN(current_season) AS start_season, + MAX(current_season) AS end_season, + 2021 AS current_season +FROM with_streaks +GROUP BY player_name, streak_identifier, is_active, scoring_class; +``` + +This query has some expensive parts that Zach doesn’t like: + +- Window functions on the entire dataset +- Aggregation at the end +- Scans all history every time + +Working with dimensions however, you can do crazy stuff like this because dimensional data is, in general, quite small w.r.t. fact data, so even re-scanning all table every time is quite legit, and probably more convenient than dealing with the complexities of the alternative. + +The alternative approach is an incremental build, which however is more prone to OOM, skew and other problems. E.g. imagine one guy that has a change every single time, or in other words, some people that are not as “slowly changing” as others, you end up having with many streaks for that specific user, and it blows up the cardinality of the final table. + +# SCD2 - incremental + +This time, we create the SCD table but incrementally. + +What Zach is doing here is showing how the incremental query can be built, by taking advantage of the pre-existing SCD2 table. + +Note that + +```sql +SELECT max(end_season) FROM players_scd; +``` + +returns `2021`. So basically, our SCD table latest changes are at most from 2021 (thanks to the filter we added above). In this new query, we’re adding data from 2022, this time incrementally. + +This query is built as a union of different CTEs, each representing a different piece of the data. Let’s take a brief look: + +- **historical_scd** +All historical records for all players up to the 2nd to last one (i.e., up to 2020 in our case) +- **unchanged_records** +All records (from last period, 2021) that didn’t change, with the valid end date (`end_season`) bumped by one +- **unnested_changed_records** +All records for players where the dimensions changed. These include both last period (2021) as well as new period (2022). That’s because in the `historical_scd`, we don’t include the last period. +- **new_records** +This is quite self explanatory. All records that didn’t exist in the last period (2021) but only in the new one (2022). These are basically the new players. + +```sql +-- we start by creating a TYPE, but this is only necessary in postgres +CREATE TYPE scd_type AS ( + scoring_class scoring_class, + is_active boolean, + start_season integer, + end_season integer +) + +-- this is the actual incremental query +WITH +last_season_scd AS ( -- latest, current record for all players + SELECT * FROM players_scd + WHERE current_season = 2021 + AND end_season = 2021 +), + +historical_scd AS ( -- historical records for all players, one per period + SELECT + player_name, + scoring_class, + is_active, + start_season, + end_season + FROM players_scd + WHERE current_season = 2021 + AND end_season < 2021 +), + +this_season_data AS ( -- new incoming data + SELECT * FROM players + WHERE current_season = 2022 +), + +unchanged_records AS ( -- records that didn't change between new data and latest records + SELECT + ts.player_name, + ts.scoring_class, + ts.is_active, + ls.start_season, + ts.current_season AS end_season + -- for these records we increase `current_season` by 1 + -- or in other words, we increase the range of the validity period + -- HINT: read `start_season` and `end_season` as `valid_from`, `valid_to` + FROM this_season_data ts + JOIN last_season_scd ls + ON ls.player_name = ts.player_name + WHERE ts.scoring_class = ls.scoring_class + AND ts.is_active = ls.is_active +), + +-- players with changed data +-- this one has 2 records per player +-- one for this period, one for previous (in this case 2022 and 2021) +changed_records AS ( + SELECT + ts.player_name, + UNNEST(ARRAY[ + ROW( + ls.scoring_class, + ls.is_active, + ls.start_season, + ls.end_season + )::scd_type, + ROW( + ts.scoring_class, + ts.is_active, + ts.current_season, + ts.current_season + )::scd_type + ]) AS records + FROM this_season_data ts + LEFT JOIN last_season_scd ls + ON ls.player_name = ts.player_name + WHERE ts.scoring_class <> ls.scoring_class + OR ts.is_active <> ls.is_active +), + +-- builds from previous CTE, just makes it more readable +unnested_changed_records AS ( + SELECT + player_name, + (records).scoring_class, + (records).is_active, + (records).start_season, + (records).end_season + FROM changed_records +), + +new_records AS ( -- new players that were not in the dataset before + SELECT + ts.player_name, + ts.scoring_class, + ts.is_active, + ts.current_season AS start_season, + ts.current_season AS end_season + FROM this_season_data ts + LEFT JOIN last_season_scd ls + ON ts.player_name = ls.player_name + WHERE ls.player_name IS NULL + -- only include those players that don't exist in last_season (ls) +) + +SELECT * FROM historical_scd + +UNION ALL + +SELECT * FROM unchanged_records + +UNION ALL + +SELECT * FROM unnested_changed_records + +UNION ALL + +SELECT * FROM new_records +``` + +This query looks quite insane, but it processes a lot less data than the other one, as it only processes the compacted data from 2022 and 2021. However, it needs quite attention since it is quite convoluted. + +Also, we made some assumptions here, in that `scoring_class` and `is_active` can never be `NULL`, but in reality they can, so one should keep that in consideration (since equality with `NULL` doesn’t make sense in SQL, so one should have to use stuff like `is distinct from` or do two checks). diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 2 - Lecture.md b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 2 - Lecture.md new file mode 100644 index 00000000..b805c293 --- /dev/null +++ b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 2 - Lecture.md @@ -0,0 +1,158 @@ +# Day 2 - Lecture + +# Intro + +The topic is **slowly changing dimensions (SCD).** + +A SCD is an attribute that can change over time, such as favorite food (e.g. some years it can be lasagna, and then later it can be curry chicken). Not all dimensions are slowly changing. Some never change, like your birthday, for instance. + +SCDs need to be modeled properly. If they’re not, you risk hindering “idempotency” (a property of your data pipeline to always return the same result if processing the same data more than once). + +# Idempotent pipelines are CRITICAL + +*Idempotent → denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.* + +Terrible definition, so let’s review it. + +**Pipelines should produce the same result (given the same exact inputs)** + +- Regardless of the day you run it +- Regardless of how many times you run it +- Regardless of the hour that you run it + +This is important because imagine having a pipeline that you run today, and then in a week you backfill it, you will end up with different data. + +## Why is troubleshooting non-idempotent pipelines hard? + +They fail silently. They don’t crash, but the end result is different every time. In other words, it’s non reproducible. You only notice when data inconsistencies show up (and your analyst yells at you). + +### What can make a pipeline not idempotent + +- `INSERT INTO` without `TRUNCATE` → *This creates duplicates!* + - Better idea, never use `INSERT INTO` + - Use `MERGE` or `INSERT OVERWRITE` every time instead +- Using `start_date >` without a corresponding `end_date <` +→ Imagine a pipeline with a clause like `WHERE date > yesterday`. If you run it today, you get one day of data. If you run it tomorrow, you get two days of data, and so on. Every time you run it you get one more day of data, and this is not idempotent. + - Instead, you should be using a window of data, i.e. 1 day of data, 2 days of data, 1 week of data etc. + - The **date range** should **not be unbounded**. +- Not using a full set of partition sensors +→ Your pipeline is going to run with an incomplete set of inputs, i.e. you aren’t checking for the full set of inputs that you need for your pipeline. So it runs but it runs too early, before all inputs are ready. This again creates inconsistencies. +- Not using `depends_on_past` (it’s an Airflow term) for cumulative pipelines. Another term is “sequential processing”. +→ Imagine you have a cumulative pipeline, so it depends on “*yesterday’s*” data: the pipeline cannot run in parallel. So it has to run like **“yesterday → today → tomorrow → …”**. Most pipelines aren’t like that, most pipelines can be backfilled and ran in parallel. +In cumulative pipelines, without sequential processing, it will make a mess as the data is not processed in the order it needs to be. + + + +- Relying on the “latest” partition of a not properly modeled SCD table. + - *Example at Facebook*: a `users` table, where an account can be labeled “fake” or “not fake”, and this value can change over time depending on what the account does (e.g. starts as fake, then completes the challenge so is not fake, but then does sketchy thing so gets labeled as fake again etc…). + There was this table `dim_all_fake_accounts` which was relying on “**latest**” data from `users` table instead of “**today’s**” data. This meant that whenever `dim_all_fake_accounts` would pull data from `users`, sometimes it would effectively pull “today’s” data, and sometimes `users` had not processed yet, so it would pull from the latest partition, which happened to be “yesterday’s”. + - Cumulative table design AMPLIFIES this bug. + + This is a bit convoluted, more info on this in the lab. + +### The pains of not having idempotent pipelines + +- Backfill and production are not gonna create the same data, in other words, old and restated data are inconsistent. +- Very hard to troubleshoot bugs +- Unit testing cannot replicate the production behavior + - Unit tests can still pass even if the pipeline is not idempotent + - But if you write idempotent pipelines, unit tests become better because now they ensure the pipeline stays idempotent +- Silent failures + +# Should you model as Slowly Changing Dimensions? + +Remember that an SCD is a dimension that changes over time. E.g.: age, favorite food, phone brand, country of residence, etc… + +Other dimensions don’t change, like birthday, eye color, etc… + +There’s also a concept of rapidly changing dimensions (e.g. heart rate, which changes minute to minute). + +- Max (Beauchemin), creator of Airflow, HATES SCD data modeling. + - His whole point is that SCDs are inherently **not idempotent** + - [Link](https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a) to Max’s article about why SCD’s suck + - In short, the gist of it is that every day you have whatever the dimension value is. This creates a lot of duplicated data, but storage cost is so cheap that it’s better than sustaining the cost of fixing errors due to SCDs. +- Options for modeling SCDs + - Latest snapshot [also known as SCD type 1 — Ed.]. This only takes in consideration the latest value of a dimension. + - Daily snapshot — this is Max’s approach. Each day the dimensional data is snapshotted in full [and added to its own partition so as to isolate it from the other snapshots]. + - SCD. See [later paragraph](#types-of-scd) for explanation. The resulting table is very compressed with respect to the daily snapshot. +- The slower a dimension changes, the better results (in term of compression) one gets if modeling using SCD2 vs full snapshot. +→ Imagine this: age in years changes once a year, so a `dim_user` table with an `age` column would have a row per user per year. Very compressed w.r.t. a daily snapshot of all users. Conversely, if we’re considering `age_in_weeks` or even `age_in_days` [i know this is absurd, it’s for the sake of the example], then the compression would be much less because there would be many more rows in the `dim_user` table. + +**Why do dimensions change?** + +- Someone decides they hate iPhone and want Android now +- Someone migrates from team dog to team cat +- Someone migrates from USA to another country +- Etc… + +## **How can you model dimensions that change?** + +Like we saw above: + +- Singular (latest) snapshots [AKA SCD type 1 — Ed.]→ **not idempotent! Never really do this.** +- Daily partitioned snapshots (Max’s strategy) +- SCD types 2, 3 + +### Types of SCD + +- **Type 0** + - Dimensions that don’t change (e.g. birth date) +- **Type 1** + - You only care about the latest value. + - **Don’t use this (in OLAP) because it’s not idempotent!** +- **Type 2** + - You care about what the value was from `start_date` to `end_date` + - Current values usually have either an `end_date` that is: + - `NULL` + - Far into the future like `9999-12-31` + - Often has also an `is_current` boolean column + - Hard to use + - Since there’s more than 1 row per dimension, you need to be careful about filtering on time + - **The only SCD that is purely** **idempotent** +- **Type 3** + - You only care about “original” and “current”. Doesn’t hold on to all history. Just the first and the last. + - Benefits + - You only have 1 row per dimension + - Drawbacks + - You lose the history in between original and current + - Is this idempotent? Partially, which means it’s not (if something changes more than once). + +### Which types are idempotent + +- Type 0 and Type 2 are idempotent + - Type 0 because the values are unchanging + - Type 2 is but need to be careful with using `start_date` and `end_date` +- Type 1 isn’t idempotent + - If you backfill with this dataset, you’ll get the dimension as it is now, not as it was then! +- Type 3 isn’t idempotent + - If you backfill with this dataset, it’s impossible to know when to pick “original” vs “current” + +# SCD2 Loading + +There’s two ways one can load these tables: + +1. One giant query that crunches all daily data and crunches it down + - Inefficient but nimble + - 1 query and you’re done +2. Incrementally load the data after the previous SCD is generated + - Has the same `depends_on_past` constraint + - Efficient but cumbersome + - Generally, you want your production run to be this one, but it’s not a rule of thumb, especially if the dataset is small + + diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 3 - Lab.md b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 3 - Lab.md new file mode 100644 index 00000000..2281d541 --- /dev/null +++ b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 3 - Lab.md @@ -0,0 +1,346 @@ +# Day 3 - Lab + +In this lab, we will build a graph data model to see which NBA players play with each other and in which team. + +In the lecture we talked a lot about **vertices** and **edges**, so let’s create those tables now. + +```sql +CREATE TYPE vertex_type + AS ENUM('player', 'team', 'game'); + +CREATE TABLE vertices ( + identifier TEXT, + type vertex_type, + properties JSON, -- because postgres doesn't have a MAP type + PRIMARY KEY (identifier, type) +); + +CREATE TYPE edge_type + AS ENUM ( + 'plays_against', -- a player + 'shares_team', -- with a player + 'plays_in', -- in a game + 'plays_on' -- on a team + ); + +CREATE TABLE edges ( + subject_identifier TEXT, + subject_type vertex_type, + object_identifier TEXT, + object_type vertex_type, + edge_type edge_type, + properties JSON, + PRIMARY KEY (subject_identifier, + subject_type, + object_identifier, + object_type, + edge_type) -- idk why the PK here is so convoluted +); +``` + +Let’s create **game** as a vertex type. + +```sql +-- let's check the table first +-- this is already deduped so we just need to move it to vertices +SELECT * FROM games; + +INSERT INTO vertices +SELECT + game_id AS identifier, + 'game'::vertex_type AS type, + json_build_object( + 'pts_home', pts_home, + 'pts_away', pts_away, + 'winning_team', CASE WHEN home_team_wins = 1 THEN home_team_id ELSE visitor_team_id END + ) AS properties +FROM games; +``` + +Now we do the same for players + +```sql +-- first we check the data that we will use +SELECT + player_id AS identifier, + MAX(player_name) AS player_name, -- can also be MIN, it's just to get the name + COUNT(1) AS number_of_games, + SUM(pts) AS total_points, + ARRAY_AGG(DISTINCT team_id) AS teams +FROM game_details +GROUP BY player_id; + +-- let’s build the vertex from there +INSERT INTO vertices +WITH players_agg AS ( + SELECT + player_id AS identifier, + MAX(player_name) AS player_name, + COUNT(1) AS number_of_games, + SUM(pts) AS total_points, + ARRAY_AGG(DISTINCT team_id) AS teams + FROM game_details + GROUP BY player_id +) + +SELECT + identifier, + 'player'::vertex_type, + json_build_object( + 'player_name', player_name, + 'number_of_games', number_of_games, + 'total_points', total_points, + 'teams', teams + ) +FROM players_agg +``` + +And finally, let’s do the teams + +```sql +SELECT * FROM teams; + +INSERT INTO vertices +-- this data somehow has dupes so we have to do some silly shenanigan to dedupe it +WITH teams_deduped AS ( + SELECT *, ROW_NUMBER() OVER(PARTITION BY team_id) AS row_num + FROM teams +) +SELECT + team_id AS identifier, + 'team'::vertex_type AS type, + json_build_object( + 'abbrevation', abbreviation, + 'nickname', nickname, + 'city', city, + 'arena', arena, + 'year_founded', yearfounded + ) +FROM teams_deduped +WHERE row_num = 1 +``` + +We have created all vertices in our table. + +```sql +SELECT + type, + COUNT(1) +FROM vertices +GROUP BY 1; +``` + +![image.png](images/d3la_image.png) + +This was kind of the easy part, now we will start adding to the edges table, which is going to be more nasty. + +```sql +-- We have some dupes in this table too (this is just an issue with the data import) +-- so we need to use the same trick we did before + +INSERT INTO edges +WITH deduped AS ( + SELECT *, ROW_NUMBER() OVER(PARTITION BY player_id, game_id) AS row_num + FROM game_details +) + +SELECT + player_id AS subject_identifier, + 'player'::vertex_type AS subject_type, + game_id AS object_identifier, + 'game'::vertex_type AS object_type, + 'plays_in'::edge_type AS edge_type, + json_build_object( + 'start_position', start_position, + 'pts', pts, + 'team_id', team_id, + 'team_abbreviation', team_abbreviation + ) AS properties +FROM deduped +WHERE row_num = 1; + +-- let's take a look at what we have so far +SELECT * +FROM vertices v +JOIN edges e + ON e.subject_identifier = v.identifier + AND e.subject_type = v.type; +``` + +Now we create an edge that is `plays_against` between two players. This actually has to create 2 edges, that are a mirror of each other (A → B but also B → A). We will do this via a **SELF JOIN.** + +```sql +WITH deduped AS ( + SELECT *, ROW_NUMBER() OVER(PARTITION BY player_id, game_id) AS row_num + FROM game_details +), + +filtered AS ( + SELECT * FROM deduped + WHERE row_num = 1 +) + +SELECT + f1.player_name, + f2.player_name, + f1.team_abbreviation, + f2.team_abbreviation +FROM filtered f1 + JOIN filtered f2 -- this is the self join + ON f1.game_id = f2.game_id + AND f1.player_name <> f2.player_name +``` + +The above query [run it and see] returns both: + +- Players that played **against** each other in the same game +- Players that played **with** each other in the same game + +So we could actually use it to generate both edges (`plays_against` and `shares_team`) + +Let’s build upon it: + +```sql +-- [dedupe part] + +SELECT + f1.player_id, + f1.player_name, + f2.player_id, + f2.player_name, + CASE + WHEN f1.team_abbreviation = f2.team_abbreviation THEN 'shares_team'::edge_type + ELSE 'plays_against'::edge_type + END +FROM filtered f1 + JOIN filtered f2 + ON f1.game_id = f2.game_id + AND f1.player_name <> f2.player_name +``` + +However, the above creates an **edge per game**, but we don’t want that, we want to create an aggregation of both sides. So let’s create it. + +```sql +-- [..] +SELECT + f1.player_id, + f1.player_name, + f2.player_id, + f2.player_name, + CASE + WHEN f1.team_abbreviation = f2.team_abbreviation THEN 'shares_team'::edge_type + ELSE 'plays_against'::edge_type + END AS edge_type, + COUNT(1) AS num_games, + SUM(f1.pts) AS left_points, + SUM(f2.pts) AS right_points +FROM filtered f1 + JOIN filtered f2 + ON f1.game_id = f2.game_id + AND f1.player_name <> f2.player_name +GROUP BY f1.player_id, + f1.player_name, + f2.player_id, + f2.player_name, + CASE + WHEN f1.team_abbreviation = f2.team_abbreviation THEN 'shares_team'::edge_type + ELSE 'plays_against'::edge_type + END; + +-- BTW this query is mega slow so either wait or add a LIMIT 100; +``` + +Now the result of this query shows who played against or with who, how many times, and how many points were scored by each. + +One issue is that there’s going to be a lot of edges and half of it is a duplicate, because **A → B** has the same meaning as **B → A** (the ARROW is the edge, i.e. this relationship we’re modeling). + +In other words, this is a 2 sided connection, and who is the subject and who is the object doesn’t matter (it’s a transitive relationship). + +To obtain just one edge, so no duplicates but same data, we can add the following: + +```sql +FROM [..] +WHERE f1.player_id > f2.player_id +``` + +This makes it so that we do not have double edges. + +Let’s now wrap the query in another CTE and actually build the **edges**. + +```sql +-- [..] +-- we have to add a MAX to player_name, and remove it from GROUP BY +-- because some players may have same IDs but different names, so it will create dupes +-- and not allow us to INSERT INTO the edges table. +-- This solution fixes the issue. +aggregated AS ( + SELECT + f1.player_id AS subject_player_id, + MAX(f1.player_name) AS subject_player_name, + f2.player_id AS object_player_id, + MAX(f2.player_name) AS object_player_name, + CASE + WHEN f1.team_abbreviation = f2.team_abbreviation THEN 'shares_team'::edge_type + ELSE 'plays_against'::edge_type + END AS edge_type, + COUNT(1) AS num_games, + SUM(f1.pts) AS subject_points, + SUM(f2.pts) AS object_points + FROM filtered f1 + JOIN filtered f2 + ON f1.game_id = f2.game_id + AND f1.player_name <> f2.player_name + WHERE f1.player_id > f2.player_id + GROUP BY + f1.player_id, + f2.player_id, + CASE + WHEN f1.team_abbreviation = f2.team_abbreviation THEN 'shares_team'::edge_type + ELSE 'plays_against'::edge_type + END + +) + +SELECT + subject_player_id AS subject_identifier, + 'player'::vertex_type AS subject_type, + object_player_id AS object_identifier, + 'player'::vertex_type AS object_type, + edge_type AS edge_type, + json_build_object( + 'num_games', num_games, + 'subject_points', subject_points, + 'object_points', object_points + ) +FROM aggregated; +``` + +Let’s check the final results with some example queries. + +```sql +-- shows all relationships between players +SELECT * +FROM vertices v + JOIN edges e + ON v.identifier = e.subject_identifier + AND v.type = e.subject_type +WHERE e.object_type = 'player'::vertex_type; + +-- this shows for each player, their career performance (in pts per game) +-- as well as how they perform when paired with (or against) another player +-- [in the video, Zach rushes this so there were some inconsistencies that I fixed -- Ed.] +SELECT + v.properties->>'player_name' AS player_name, + e.object_identifier AS other_player_id, + CAST(v.properties->>'total_points' AS REAL) / + CASE WHEN CAST(v.properties->>'number_of_games' AS REAL) = 0 THEN 1 + ELSE CAST(v.properties->>'number_of_games' AS REAL) END AS career_avg, + CAST(e.properties->>'subject_points' AS REAL) / + CASE WHEN CAST(e.properties->>'num_games' AS REAL) = 0 THEN 1 + ELSE CAST(e.properties->>'num_games' AS REAL) END AS avg_points_per_game +FROM vertices v + JOIN edges e + ON v.identifier = e.subject_identifier + AND v.type = e.subject_type +WHERE e.object_type = 'player'::vertex_type; +``` diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 3 - Lecture.md b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 3 - Lecture.md new file mode 100644 index 00000000..651e157e --- /dev/null +++ b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/Day 3 - Lecture.md @@ -0,0 +1,186 @@ +# Day 3 - Lecture + +# Intro + +In this lecture we talk about graph data modeling. + +Graph data modeling is more relationship focused rather than entity focused, but comes with a tradeoff in that you don’t have that much of a schema, it’s rather flimsy and flexible. + + + +# What makes a dimension additive? + +Additive dimensions mean that you don’t *“double count”*. + +In other words, you can take all the sub totals and sum them up, you get the grand total, and it’s a correct number. + +E.g: *Population = all 1 year olds + all 2 year olds + all 3 year olds + …* + +Not all dimensions are additive. + +E.g: *Total* c*ar drivers is NOT EQUAL to all Honda drivers + all Toyota drivers … What if a person owns 2 different cars?* + +## The essential nature of additivity + +A dimension is additive over a specific **time window** if and only if the grain of data over that window can only ever be one value at a time. + +Back to the car example, for instance: on a small enough time scale, you can say that *# drivers = all Honda drivers + all Toyota drivers + …* + +If the window is like 1 second, that’s obvious, because no one can drive 2 cars at once. In this case the dimension is additive, but as the timescale gets larger, it loses additivity. + +### How does additivity help? + +- If you have the subtotals you can just add them up, and you don’t need to use `COUNT(DISTINCT)` **on pre-aggregated dimensions**. In other words, you don’t have to go back down one level of grain to get the total count. +- Non-additive dimensions are usually non-additive w.r.t. `COUNT` aggregations, but not `SUM`. +→ E.g. if you sum instead all miles driven by Honda drivers + all driven by Toyota drivers and so on, this makes sense, because one can’t drive two cars at once, ever. + + + +# When should you use enums? + +We covered **enums** in the last lab, with the NBA “scoring class”, like *star, good, average, bad.* + +One things with enumerations is that there is a limit how much you wanna enumerate. + +- Enums are great for low-to-medium cardinality +- Country is a great example of where Enums start to struggle. + +**Why should you use enums?** + +- Built in data quality +→ If model a field as an **enum** and you get a value that doesn’t fit, the pipeline fails. +- Built in static fields +→ Sometimes there’s fields of an enum that are just static that you don’t need to know about. [Not exactly sure what Zach means here — Ed.] +- Built in documentation +→ You already know what all the possible values of a thing can be. If it’s just a `STRING`, you don’t get that. + +Don’t throw them around everywhere like it’s rice at a wedding, use them properly! + +## Enumerations and subpartitions + +- Enums make amazing subpartitions, because + - You have an exhaustive list + - They chunk up the big data problem into manageable pieces +- The little book of pipelines [example](https://github.com/EcZachly/little-book-of-pipelines) +→ Design developed by Zach, leverages the concept of Enum to a great extent. + +The following diagram shows an example of this design pattern, with 3 sources (but there can be a lot more) + +![image.png](images/d3le_image.png) + +So here you have an **enum** that contains a set of values, like “fees”, “coupons”, “infra cost” and other things, and they’re all different sources. + +This enum gets shared with all **source functions**, and what these source functions do is they map the data to a **shared schema**, shared among all sources. + +The Little book of Enums also includes the several DQ checks for each source/partition. + +After each DQ check, the **subpartitioned output** will have a **date** partition, and then the **subpartition** will be the **enum values.** + +This patterns scales nicely, because if you need another source, you just add another value to the little book of enum, and that’s it, and you also have the docs for free, because people can just query the book easily. + +**How does the enum actually look?** + +Usually it’s just a Python or Scala enumerator, and then you have a job that turns it into a tiny table, as many rows as values in the enum, and that’s how you can share it between your DQ checks and your **source functions.** Check out the example at the link above for proper implementation. + +### What type of use cases is this enum pattern useful? + +Whenever you have tons of sources mapping to a shared schema. Or in other words, when tons of different sources need to be brought together downstream. + +Some examples from Zach: + +- Airbnb + - Unit Economics (fees, coupons, credits, insurance, infra cost, taxes, etc…) +- Netflix + - Infrastructure Graph (applications, databases, servers, code bases, CI/CD jobs, etc…) +- Facebook: + - Family of Apps (oculus, instagram, facebook, messenger, whatsapp, threads, etc…) + +What you end up with, is that all this different data ends up being in the **SAME TABLE.** + +**How do you model data from disparate sources into a shared schema?** + +Zach says with what he calls a **Flexible Schema**! + + + +In this flexible schema, you want to often leverage a **MAP datatype**, which sort of overlaps with a graph data model. + +# Flexible schemas + +What do you do if you need to add more things? Just put them in the map! Throw more columns in there. + +- Benefits + - You don’t have to run ALTER TABLE commands + - You can manage a lot more cols + - Your schemas don’t have a ton of “NULL” columns + - `other_properties` column is pretty awesome for rarely-used-but-needed columns +- Drawbacks + - Compression is usually worse (especially if you use JSON) + → Reason is, the header has to + - Readability, queriability + +# How is graph data modeling different? + +Graph modeling is **relationship** focused, **not entity** focused. + + + +The main thing to remember is that it’s not entity focused, so we don’t care about columns. + +An entity in a graph has 3 columns: + +- id: `STRING` +- type: `STRING` +- properties: `MAP` + +The whole idea behind graph DBs is that it shifts focus from **how things are** to **how things are connected.** + +--- + +The schema above is for **vertexes** (entities). + +**Edges** (relationships) have their own schema: + +- subject_id: `STRING` +→ the entity **DOING** the thing +- subject_type: `VERTEX_TYPE` +- object_id: `STRING` +→ the entity **RECEIVING** the thing +- object_type: `VERTEX_TYPE` +- edge_type: `EDGE_TYPE` +→ almost always a **verb**: “is”, “plays with”… +- properties: `MAP` + +Example for player in a team: + +- subject_type: `player` +- object_type: `team` +- edge_type: `plays_on` +- properties: `{years_playing_on_team: "", starting_year: ""}` + +![image.png](images/d3le_image%201.png) diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 1.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 1.png new file mode 100644 index 00000000..20069a44 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 1.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 2.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 2.png new file mode 100644 index 00000000..8872ce92 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 2.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 3.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 3.png new file mode 100644 index 00000000..6f438559 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image 3.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image.png new file mode 100644 index 00000000..b5086873 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1la_image.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 1.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 1.png new file mode 100644 index 00000000..e7e594e5 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 1.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 2.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 2.png new file mode 100644 index 00000000..7df8c798 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 2.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 3.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 3.png new file mode 100644 index 00000000..85e08ae4 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image 3.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image.png new file mode 100644 index 00000000..7db73c9d Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d1le_image.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d2la_image.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d2la_image.png new file mode 100644 index 00000000..ceb36f70 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d2la_image.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3la_image.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3la_image.png new file mode 100644 index 00000000..439504ec Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3la_image.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3le_image 1.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3le_image 1.png new file mode 100644 index 00000000..7e7d88d8 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3le_image 1.png differ diff --git a/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3le_image.png b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3le_image.png new file mode 100644 index 00000000..dc5099f2 Binary files /dev/null and b/bootcamp/materials/1-dimensional-data-modeling/markdown_notes/images/d3le_image.png differ diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 1 - Lab.md b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 1 - Lab.md new file mode 100644 index 00000000..498ed9f7 --- /dev/null +++ b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 1 - Lab.md @@ -0,0 +1,216 @@ +# Day 1 - Lab + +In this lab, we will work mostly with `game_details` table. + +```sql +SELECT * FROM game_details; +``` + +This table is quite terrible. + +When you’re working with fact data, the **grain** of the data matters a lot. The grain is considered the lowest common denominator, the unique ID of the table. + +For `game_details`, the grain is + +```sql +SELECT + game_id, team_id, player_id, count(1) +FROM game_details +GROUP BY 1, 2, 3 +HAVING count(1) > 1 +``` + +You can see that we actually have a lot of dupes. So we want to create a filter here to get rid of them. + +```sql +-- this is gonna be our start query we will work with +WITH deduped AS ( + SELECT + *, ROW_NUMBER() OVER(PARTITION BY game_id, team_id, player_id) AS row_num + FROM game_details +) + +SELECT * FROM deduped +WHERE row_num = 1; +``` + +One things about this fact data is that it’s very denormalized, and probably a lot of the things here aren’t really necessary. At the same time, there are also missing columns. Remember from Lecture 1, facts need a **when**, and there’s no when here at all. + +The **when** column we get it from `game_id`. + +```sql +WITH deduped AS ( + SELECT + *, + g.game_date_est, + ROW_NUMBER() OVER(PARTITION BY gd.game_id, team_id, player_id ORDER BY g.game_date_est) AS row_num + FROM game_details gd + JOIN games g ON gd.game_id = g.game_id +) +SELECT * FROM deduped +WHERE row_num = 1; +``` + +> Note: regarding unnecessary data, look that we have both `team_abbreviation` and `team_city` in here. But `team`s is never going to be big data, even in the next 250 years. So having those two columns in an abomination, as you can easily join the `teams` table at any given time. +> + +Game however, is different, as `games` is gonna grow much more than `teams`. If we have to join it every time, it will become very slow, especially over time. + +Let’s select and parse just the columns we care about. + +```sql +WITH deduped AS ( + SELECT + g.game_date_est, + g.season, + g.home_team_id, *, + ROW_NUMBER() OVER(PARTITION BY gd.game_id, team_id, player_id ORDER BY g.game_date_est) AS row_num + FROM game_details gd + JOIN games g ON gd.game_id = g.game_id +) + +SELECT -- we don't put `game_id` because we already pull the needed info from `games` + game_date_est, + season, + team_id, + team_id = home_team_id AS dim_is_playing_at_home, + player_id, + player_name, + start_position, + -- the comment column is quite high in cardinality + -- so we parse the most important situations + COALESCE(POSITION('DNP' in comment), 0) > 0 AS dim_did_not_play, + COALESCE(POSITION('DND' in comment), 0) > 0 AS dim_did_not_dress, + COALESCE(POSITION('NWT' in comment), 0) > 0 AS dim_did_not_with_team, + -- minutes were a string like "12:56:, so we transformed it + -- into a proper decimal number + CAST(SPLIT_PART(min, ':', 1) AS REAL) + + CAST(SPLIT_PART(min, ':', 2) AS REAL) / 60 AS minutes, + fgm, -- some basketball jargon 👇 + fga, + fg3m, + fg3a, + ftm, + fta, + oreb, + dreb, + reb, + ast, + stl, + blk, + "TO" AS turnovers, + pf, + pts, + plus_minus +FROM deduped +WHERE row_num = 1; +``` + +Let’s now create the DDL for the above data + + + +```sql +CREATE TABLE fct_game_details ( + dim_game_date DATE, + dim_season INTEGER, + dim_team_id INTEGER, + dim_player_id INTEGER, + dim_player_name TEXT, + dim_start_position TEXT, + dim_is_playing_at_home BOOLEAN, + dim_did_not_play BOOLEAN, + dim_did_not_dress BOOLEAN, + dim_not_with_team BOOLEAN, + m_minutes REAL, -- `m_` prefix stands for `measure`, to distinguish from `dim_` + m_fgm INTEGER, + m_fga INTEGER, + m_fg3m INTEGER, + m_fg3a INTEGER, + m_ftm INTEGER, + m_fta INTEGER, + m_oreb INTEGER, + m_dreb INTEGER, + m_reb INTEGER, + m_ast INTEGER, + m_stl INTEGER, + m_blk INTEGER, + m_turnovers INTEGER, + m_pf INTEGER, + m_pts INTEGER, + m_plus_minus INTEGER + PRIMARY KEY (dim_game_date, dim_team_id, dim_player_id) + -- team_id is a bit redundant but we add to the PK cause of indexing reasons + -- (in postgres case in specific) +) +``` + +Using the `dim_` and `m_` naming convention is useful because it indicates that `dim_` are columns you should do group by on and filter on, whereas `m_` are columns that you should aggregate and do math on. + +Now we take the previous `SELECT` query and insert the results in the table we just created: + +```sql +INSERT INTO fct_game_details +WITH deduped AS ( + SELECT + g.game_date_est, + g.season, + g.home_team_id, + gd.*, + ROW_NUMBER() OVER(PARTITION BY gd.game_id, team_id, player_id ORDER BY g.game_date_est) AS row_num + FROM game_details gd + JOIN games g ON gd.game_id = g.game_id +) + +SELECT + game_date_est AS dim_game_date, + season AS dim_season, + team_id AS dim_team_id, + player_id AS dim_player_id, + player_name AS dim_player_name, + start_position AS dim_start_position, + team_id = home_team_id AS dim_is_playing_at_home, + COALESCE(POSITION('DNP' in comment), 0) > 0 AS dim_did_not_play, + COALESCE(POSITION('DND' in comment), 0) > 0 AS dim_did_not_dress, + COALESCE(POSITION('NWT' in comment), 0) > 0 AS dim_did_not_with_team, + CAST(SPLIT_PART(min, ':', 1) AS REAL) + + CAST(SPLIT_PART(min, ':', 2) AS REAL) / 60 AS m_minutes, + fgm AS m_fgm, + fga AS m_fga, + fg3m AS m_fg3m, + fg3a AS m_fg3a, + ftm AS m_ftm, + fta AS m_fta, + oreb AS m_oreb, + dreb AS m_dreb, + reb AS m_reb, + ast AS m_ast, + stl AS m_stl, + blk AS m_blk, + "TO" AS turnovers, + pf AS m_pf, + pts AS m_pts, + plus_minus AS m_plus_minus +FROM deduped +WHERE row_num = 1; +``` + +In a real setting, you should be changing the name of things, because sometimes the column names are terrible. + +Let’s do some example analytics on this data. + +```sql +-- query to see who has the highest bail % on games +SELECT + dim_player_name, + COUNT(CASE WHEN dim_not_with_team THEN 1 END) AS bailed_num, + COUNT(CASE WHEN dim_not_with_team THEN 1 END) * 1.0 / COUNT(1) AS bailed_pct +FROM fct_game_details +GROUP BY 1 +ORDER BY 3 DESC; +``` diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 1 - Lecture.md b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 1 - Lecture.md new file mode 100644 index 00000000..493ead33 --- /dev/null +++ b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 1 - Lecture.md @@ -0,0 +1,203 @@ +# Day 1 - Lecture + +# What is a fact? + +Something that **happened** or **occurred.** + +- A user logs into an app +- A transaction is made +- You run a mile with your fitbit + +Contrary to dimensions, **facts don’t change**, which makes them easier to model than dimensions **in some respects**. + +It should be something atomic, you can’t break it down into smaller pieces (so like the *mile example* is not perfect because you can break it down). + +# What makes fact modeling hard? + +Fact data is usually 10-100x the volume of dimension data. + +- Example of how many steps you take in a day. You take like 5000-10000, vs YOU, as a single datapoint in a `users` table. +- Facebook had 2B active users when Zach worked there, and sent 50B notifications every day. + +Fact data can need a lot of context for effective analysis. + +- Imagine sending a notification. That fact in isolation is quite useless. But imagine sending a notification, and then the user clicks on it, and then 20 minutes later they buy something. This is a lot more informative. + +So basically, fact data needs other data around it (other facts or even dimensions) to be valuable + +Duplicates in facts are way more common than in dimensional data. + +- E.g. maybe the SWE team publishes a bug that logs things twice, but sometimes duplicates can be genuine, for instance a user clicking twice on a notification, a few hours apart. This is a proper action that we want to log, but not count it as 2, cause then you’d have skewed metrics (e.g. CTR = 200%). Dupes are very challenging to work with. + +# How does fact data modeling work? + +**Normalization vs Denormalization** + +- Both of them important and powerful, and both of them can cause lots of problems +- Normalized facts don’t have any dimensional attributes, just IDs to join to get that information. +- Denormalized facts bring in some dimensional attributes for quicker analysis at the cost of more storage + + + +**Fact data and raw logs are NOT the same thing.** + +They’re married, but not the same! + +- **Raw logs** → usually owned by people without strong data skills — SWEs. Don’t really care quite as much. As a DE you can work with them to get the data logged in the correct format. + - Ugly schemas designed for online systems that make data analysis sad + - Potentially contains duplicates and other quality errors + - Usually have shorter retention +- **Fact data** → the trust in this data should be orders of magnitude higher than raw logs. As a DE, this is your goal, convert raw logs into highly trusted fact data. **** + - Nice column names + - Quality guarantees like uniqueness, not null, etc… + - Longer retention + +Think of facts as Who, Where, When, What, and How? + +- “Who” fields are usually pushed out as IDs (*this user clicked this button,* we only hold the `user_id` not the entire user object) +- “Where” fields + - Location, country, city, state + - **OR** also, **where** in a web app the user did the action. Can also be modeled with **ids**, but often doesn’t happen. +- “How” fields + - How fields are very similar to “where” fields. *“He used an iphone to make this click”.* +- “What” fields → Fundamentally part of the nature of the fact + - In notification world: “GENERATED”, “SENT”, “CLICKED”, “DELIVERED” +- “When” fields → Same as *what*, part of the nature of the fact + - Mostly an `event_timestamp` field or `event_date` + - **Make sure that devices are logging UTC timezone, even in clientside.** + - **Clientside logging** is usually better than serverside logging, because you can get all the interactions the user does, and not just things that cause a web request to happen. + +- Fact datasets should have quality guarantees → if they didn’t, analysis would just go to the raw logs + - **No duplicates!** + - **Certain fields that must exist: e.g “what” and “when” should never be null.** + - “Who” field also is quite important not to be null. +- Fact data should generally be smaller than raw logs +- Fact data should parse out hard-to-understand columns! + - Often SWEs pass you a column that is a string but is a JSON and is a blob of nastyness that’s impossible to query. Fact data should not have these kinds of shits and be simple! + + + +# When should you model in dimensions? + +Example from Zach’s time at Netflix: [Network logs pipeline](https://www.youtube.com/watch?v=g23GHqJje40) + +- 2PBs of brand new data every day, over 100TBs/hr +- Loads of microservices + - Really good to make development faster, and split responsibilities + - A mess when it comes to security, ‘cause instead of having one app that can be hacked, imagine having 3000. +- They wanted to see which microservice app each network request came from and went to, in order to see how an app getting hacked would affect others + - The only way to do that was by looking at all network traffic, and that was insane. +- What they did was take the network traffic and JOIN it with a small database of IP addresses and app names, all of the IP addresses for Netflix microservices architecture + - IP addresses were the identifiers of the apps. + - The join would result into something like `IP_1 talks to IP_2` etc… +- This was doable because the right side of the JOIN is a small table that could fit in a **broadcast join** +- When they needed to move to IPv6, they realized the new JOIN would have not worked anymore, cause there were too many. + - They tried but the cost was skyrocketing +- The solution was to not do the JOIN at all, and have instead all of the apps log the “app” field with each network request. **DENORMALIZATION SAVES THE DAY**. +- This required each microservice to adopt a “sidecar proxy” that enabled logging of which app they were. +- Very large org effort to solve this issue. + + + +# How does logging fit into fact data? + +Logging should give all the columns that you need, except maybe for some dimensional columns. + +- Logging brings in all the critical context for your fact data + - Usually done in collaboration with online system engineers → They are the ones knowing more about event generation, i.e. **WHEN and HOW** those events are being created. +- **Don’t log everything!** + - Log only what you really need + - Raw logs can be very expensive and cost a lot of money in the cloud + - Logging stuff “just in case” is an anti-pattern +- Logging should conform to values specified by the online teams + - There should be some sort of contract or schema or shared vision so that teams can be easily aligned (at the technical level too) + - Airbnb example: imagine part of a system is written in Ruby, another in Scala, how do you reconcile the schemas? You can’t import a library. So you need some kind of middle layer between these 2 realities. + - At Netflix and Airbnb these things were defined in a *Thrift schema* + - **Thrift** [I think Zach’s talking about [Apache Thrift](https://thrift.apache.org/) — Ed.] is a specification that is language agnostic and it’s a way to describe schema and data (and functions), in a way that is sharable, so that both Ruby and Scala could reference the same schema for price. + - Ideally there would be a test here so that if team A breaks teams B code, team A can’t push until they talked to team B (e.g. “hey we’re adding this new column etc…”). + +# Options when working with high volume fact data + +**Sampling** + +- Sometimes the solution is to filter down and not work with all of the data. Doesn’t work for all use cases, best used for metric-driven use cases where imprecision isn’t an issue. +- In some cases it can’t work: e.g. security or those very low probability situations. + +Overall it can be very cost effective, as past a certain amount of data you get a diminishing return on how accurate a statistic can be (law of large numbers and whatnot). + +**Bucketing** + +- Fact data can be bucketed by one of the important dimensions (usually user, the “who”) +- If you join on the bucketed column, you don’t have to shuffle across the entire dataset, but just within the bucket, and it can be a lot faster. +- Sorted-merge bucket (SMB) joins can do joins without shuffle at all because both sides of data are sorted and can be joined “like a zipper”. + +# How long should you hold onto fact data? + +If you have very high volume fact data you can’t just hold on to it forever, it becomes very costly. + +- Big tech has an interesting approach here: + - Any fact tables <10 TBs, retention didn’t matter much + - Anonymization of facts usually happened after 60-90 days though, and the data would be moved to a new table with the PII stripped. + - Any fact tables >100 TBs, **VERY SHORT RETENTION** (~14 days or less) + - These sizes (10 and 100 TBs) were for big tech, but any company can choose on their own window. + + + +# Deduplication of fact data + +- Facts can often be duplicated + - e.g. you can click a notification multiple times (*think about the timeframe*) +- How do you pick the right window for deduplication? + - No duplicates in a day? An hour? A week? + - Imagine if a user clicks a Facebook notification today, and then the same notification a year from now. Does it matter even if it’s duplicate? Most likely not. + - You can do some analysis on your dataset to look at distribution of duplicates. + +**Intraday deduping options** + +- Streaming → very short time frame +- Microbatch → hourly basis + +## Streaming to deduplicate facts + +Streaming allows you to capture most duplicates in a very efficient manner. + +- You can capture the duplicates on whatever window you want → “we saw this notification ID and we’re gonna hold on to it X time. If we see it again within X, then we have found a duplicate” +- A large majority of dupes happen in a short window after a short event +- **Entire day duplicates can be harder for streaming because it need to hold onto such a big window of memory.** In the end streaming didn’t work because it used too much memory *(keep in mind we’re talking about FAANG business, so like 50B records a day — a problem most people won’t have)* + +## Hourly microbatch dedupe + +Used to reduce landing time of daily tables that dedupe slowly (last point of previous paragraph). + +In Zach’s example, this took 1 hour instead of 9 hours every day, deduping 50B notification events every day. + +Steps: + +- Get all the data for a specific hour → Aggregate down (`GROUP BY`). +- A `FULL OUTER JOIN` between hour 0 → 1, or 1 → 2, or 2 → 3 etc… + - What this does it makes it so that it eliminates dupes that are across hours. Eg. a dupe that happened in hour 0 and also in hour 1. +- This all comes together and it branches like a tree + +![image.png](images/d1le_image.png) + +- Basically the hour pairs keep merging until you have the final, daily dataset. diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 2 - Lab.md b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 2 - Lab.md new file mode 100644 index 00000000..890305e7 --- /dev/null +++ b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 2 - Lab.md @@ -0,0 +1,181 @@ +# Day 2 - Lab + +Let’s take a look at `events` table. + +```sql +SELECT * FROM events; +``` + +This is network requests going to Zach’s website. + +What we wanna do is to cumulate this up, and find all days where different users were active. + +For starters, let’s create a table + +```sql +CREATE TABLE users_cumulated ( + user_id TEXT, -- text because ids are bigger than BIGINT allows + dates_active DATE[], -- list of past dates where user was active + date DATE, -- current date for the user + PRIMARY KEY (user_id, date) +); +``` + +Let’s now get started with cumulative table design, like we did the previous week. + +```sql +INSERT INTO users_cumulated +WITH +yesterday AS ( + SELECT + * + FROM users_cumulated + WHERE date = DATE('2022-12-31') -- last date before beginning of dataset +), + +-- from this, we need all users who were active "today" +-- this is where you need to come up with a definition of "active". +today AS ( + SELECT + CAST(user_id AS TEXT), + DATE(CAST(event_time AS TIMESTAMP)) AS date_active + FROM events + WHERE DATE(CAST(event_time AS TIMESTAMP)) = DATE('2023-01-01') -- last date before beginning of dataset + AND user_id IS NOT NULL + GROUP BY user_id, DATE(CAST(event_time AS TIMESTAMP)) +) + +-- Let's match the schema of the table we just created +SELECT + COALESCE(t.user_id, y.user_id) AS user_id, + CASE + WHEN y.dates_active IS NULL THEN ARRAY[t.date_active] + WHEN t.date_active IS NULL THEN y.dates_active + ELSE ARRAY[t.date_active] || y.dates_active + END AS dates_active, + COALESCE(t.date_active, y.date + INTERVAL '1 day') AS date +FROM today t + FULL OUTER JOIN yesterday y + ON t.user_id = y.user_id; + + +SELECT * FROM users_cumulated; +``` + +Now build this up by bumping both dates by 1 day for the whole month of January, and running the query each time. + +To make sure, after the 2nd cumulation check: + +```sql +SELECT * FROM user_cumulated +WHERE date = '2023-01-02'; +``` + +You see that the last date, is the first element in the array `dates_active`. + +Let’s now generate the “date list” for a month, with the BIT MASK as described in the lecture. + +```sql +-- first we need to generate the series of the days we will consider, so all January +SELECT * +FROM generate_series(DATE('2023-01-01'), DATE('2023-01-31'), INTERVAL '1 day'); + +WITH +users AS ( + SELECT * + FROM users_cumulated + WHERE date = ('2023-01-31') +), + +series AS ( + SELECT * + FROM generate_series(DATE('2023-01-01'), DATE('2023-01-31'), INTERVAL '1 day') + AS series_date +), + +place_holder_ints AS ( + SELECT + CASE + -- when the specific date is in the array of active dates + WHEN dates_active @> ARRAY[DATE(series_date)] + THEN CAST(POW(2, 32 - (date - DATE(series_date))) AS BIGINT) + ELSE 0 + END AS placeholder_int_value, + * + FROM users CROSS JOIN series +) + +SELECT + user_id, + CAST(CAST(SUM(placeholder_int_value) AS BIGINT) AS BIT(32)) AS bit_mask +FROM place_holder_ints +GROUP BY user_id; +``` + +The procedure to generate the bit mask works as such: + +- Join each user with the list of dates of the month → We get 31 rows per user. +- Wherever the date from the list of dates **is in** the user’s `dates_active` array, it means the user was **active on that date** + - Find how many days passed since the start of the period (which is 32 days long, staring yesterday) + → `x = 32 - (today - date_active)` + - So imagine, if today is Jan 31st, the period is from **Jan 30th back until Dec 31**. + - So if `date_active == today`, there’s 32 days since beginning of period (counting today as well) + *Note: This (today’s activity) won’t appear in the bit mask, as it overflows the BIT(32). It can however show up if instead of `32` , you subtract from `31`. Depends on the situation.* +- Calculate 2 to the power of the number above +- Transform this power into binary number + +That’s cause $2^x$ generates a number $n$ with `len(n) = x` , with only the leftmost digit = 1, and everything else equal to 0. By summing all powers of 2, and then getting the `BIT(32)` value of it, we get the **BIT MASK** for the full month, where 1s are active days, and 0s are inactive days, for any given user. Given that we use `BIT(32)`, the total length of the bit mask is 32, so we can look back to **at most** 32 days of activity (where the 1st is “today”, so last day is 31 days “ago”). + +In other words, a person active ONLY yesterday looks like `10000...`, one active ONLY two days ago looks like `01000...`. Someone active the 1st and 30th of January (again assuming today is 31st) looks like `1000......001`. The `len` of this number is always **32.** + +This cumulative table design is a very powerful way to record user activity in minimal space and also saves a lot of time for analytics as the queries won’t require any aggregation, since you just look at the bit mask of the activity. + +Let’s now see how we can calculate if a user is `monthly_active`. + +```sql +-- [..] +BIT_COUNT(CAST(CAST(SUM(placeholder_int_value) AS BIGINT) AS BIT(32))) +``` + +Counting “active” bits shows how many days throughout the month the user was active. + +So we can do + +```sql +BIT_COUNT(...) > 0 AS dim_is_monthly_active +``` + +What if we want to do is `weekly_active` instead? + +This is not super elegant, but basically we **mask** the original bit mask with just the last 7 days, with a bitwise operation: `1111111000... & bit_mask`. This is called a **bitwise and.** + +In Postgresql it’s like this + +```sql +CAST('11111110000000000000000000000000' AS BIT(32)) & +CAST(CAST(SUM(placeholder_int_value) AS BIGINT) AS BIT(32)) +``` + +Now, we can do the `BIT_COUNT` again and you’d get `dim_weekly_active`. + +The same for other values, like `daily_active` (e.g. active last day). + +```sql +-- [..] + +SELECT + user_id, + CAST(CAST(SUM(placeholder_int_value) AS BIGINT) AS BIT(32)) AS bit_mask, + BIT_COUNT(CAST(CAST(SUM(placeholder_int_value) AS BIGINT) AS BIT(32))) > 0 + AS dim_is_monthly_active, + BIT_COUNT(CAST('11111110000000000000000000000000' AS BIT(32)) & + CAST(CAST(SUM(placeholder_int_value) AS BIGINT) AS BIT(32))) > 0 + AS dim_is_weekly_active, + BIT_COUNT(CAST('10000000000000000000000000000000' AS BIT(32)) & + CAST(CAST(SUM(placeholder_int_value) AS BIGINT) AS BIT(32))) > 0 + AS dim_is_daily_active +FROM place_holder_ints +GROUP BY user_id; +``` + +Also, while this operation looks ugly in code, it’s extremely efficient from a computing perspective. diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 2 - Lecture.md b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 2 - Lecture.md new file mode 100644 index 00000000..99508a77 --- /dev/null +++ b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 2 - Lecture.md @@ -0,0 +1,133 @@ +# Day 2 - Lecture + +# Is it a fact or a dimension + +How do we know the way to really differentiate these things? + +When Zach was working in Growth at Facebook, he was often dealing with two similar fields: `dim_is_active` and `dim_is_activated`. Both of them are “dimensions” on a user object, but with different meanings. + +- `dim_is_active` was based on whether the user had any activity for at least a minute or engaged in the app in some way. +- `dim_is_activated` this was related to whether the user had deactivated or not their Facebook account (e.g. to take a break from social media). + +You can see however that `dim_is_active` is a dimension is based on an **event** (liking, sharing, doing something), so is it really a dimension at that point? Or is it just an aggregation of facts? + +*Zach says: it’s both. This is where modeling can get dicey.* + +Compare it with `dim_is_activated`. In that case, you just go to your settings and click “deactivate account”, and that flags it and that’s all. It’s not an aggregation of facts. This is a proper dimension, an attribute on a user object. + +Something to think about if you’re creating a dimension over an aggregation of fact is “What is the cardinality of that dimension?”. A lot of times, you want to bucketize it, to reduce the total possible values, and make your `GROUP BYs` simpler. + + + +**You can aggregate facts and turn them into dimensions** + +- Is user a “high engager” or a “low engager”? + - Think `scoring_class` from Week 1 + - `CASE WHEN` to bucketize aggregated facts can be very useful to reduce the cardinality + +Nice rule of thumb: bucketize between 5 and 10 max values. (your buckets **HAVE** to make sense tho). + +Anyway, the TL;DR is that facts and events are not so clear cut, and more blurry, because for instance, in the case of `dim_is_activated` the fact of you clicking that button makes the dimension. So in this case, clicking the button is both a fact and a dimension too. [Idk it’s peculiar — Ed.]. + +## Properties of facts vs dimensions + +- Dimensions + - Usually show up in `GROUP BY` when doing analytics + - Can be “high cardinality” or “low cardinality” (e.g. user_id → high, country → mid, gender → low) + - Generally come from a **snapshot of state** → Zach says: at Netflix, Facebook, Airbnb, there’s a prod DB, and they snapshot it at any given time, and whatever those values are in that moment are the values for the day +- Facts + - Usually aggregated when doing analytics by things like `SUM, AVG, COUNT` + - Almost always higher volume than dimensions, although some fact sources are low-volume, think “rare events” + - Generally come from events and logs + +Keep in mind you can aggregate facts and turn them into dimensions. Facts can also **change** dimensions (e.g. think about change data capture, CDC). + + + +## Airbnb example + +In Airbnb, Zach was working in Pricing and Availability. + +On a specific night, an Airbnb listing has a specific price. That price, is it a fact or a dimension? + +It might seem like a fact, because you can `SUM` it or `AVG` it, also it’s a `DOUBLE` so have extremely high cardinality, but it kind of is a dimension, as it’s the **attribute of the night.** + +The fact, in this case, would be the host changing the setting that impacts the price. A fact has to be **logged**, a dimension comes from the **state of things**. + +Price is actually derived from all of the settings that the host has set, and those setting are **state**, so price is a dimension. + +# Boolean / existence-based fact/dimensions + +Let’s talk a bit more about dimensions that are based on facts. + +- `dim_is_active` or `dim_bought_something`, etc… + - These are usually on the daily/hour grain too +- `dim_has_ever_booked`, `dim_ever_active`, `dim_ever_labeled_fake` + - These “ever” dimensions look to see if there has “ever” been a log, and once it flips one way, it never goes back. + - Interesting, simple and powerful features for machine learning → An Airbnb host with active listings who has **NEVER** been booked looks sketchier and sketchier over time +- “Days since” dimensions (e.g. `days_since_last_active`, `days_since_signup`, etc…) + - Very common in Retention analytical patterns + - Look up *“J curves”* for more details on this + +# Categorical fact/dimensions + +- Scoring class in week 1 (”star”, “good”, “avg”… based on points) + - A dimension that is derived from fact data +- Often calculated with `CASE WHEN` logic and “bucketizing” + - Example: Airbnb Superhost + +Often, these conditional columns are not bucketized based just on one single columns, but a combination of them, based on some sort of meaningful criteria. + +Once done however, these criteria can be very hard to change, because usually they become important business definitions, and therefore changing them would cause problems. + +Metaphorical example: in facebook, the hard limit for friends is 5k. If they increase it, then they can never go back. + + + +# Should you use dimensions or facts to analyze users? + +- Is the `dim_is_activated` state or `dim_is_active` logs a better metric? + - It depends! +- It’s like the difference between “signups” and “growth” in some perspectives + +Intuitively, “active users” sounds more important in this example, but both can be. For instance, you can calculate the ratio of active users over activate users, which is another great metric to look at. + +In general, it depends on the question you’re trying to answer. + +# The extremely efficient date list data structure + +- Read Max Sung’s writeup on this: +- Very efficient way to manage user growth + +One of the most common questions at Facebook was: how many monthly / weekly / daily active users we have? At that scale, a `GROUP BY` over 30 days is quite expensive. Also, redundant when done every day, cause out of 30 days, each day, only one changes, the other stay exactly the same. + +So the very naive approach is to process 30 days of data and do a `GROUP BY`. Not too smart. Also with over 2 billion users, where each user generates 50 rows a day in the fact table, so it’s 100 billion rows a day, and over a month that’s 3 trillion, imagine doing a group by on that. + +- Imagine a cumulative table design like `users_cumulated` + - user_id + - date + - dates_active - an array of all the recent days that a user was active + +👆 This is the **naive** approach. But it kinda sucks because you have this big array of dates that are not really needed, as you don’t need the date but just the offset. + +- So, you can turn that into a structure like this + - user_id, date, datelist_int + - 32, 2023-01-01, 1000000010000001 → A binary that identifies when the user was active w.r.t `date`, so in this case the user was active Jan 1st, then 25th December of the previous year, then again Dec 17th. + - The 1s in the integer represent the activity for `date - bit_position (zero indexed)` diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 3 - Lab.md b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 3 - Lab.md new file mode 100644 index 00000000..87990604 --- /dev/null +++ b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 3 - Lab.md @@ -0,0 +1,111 @@ +# Day 3 - Lab + +Let’s start by creating the `array_metrics` table we will be building today, as described at the end of day 3 lecture. + +```sql +CREATE TABLE array_metrics ( + user_id NUMERIC, + month_start DATE, + metric_name TEXT, + metric_array REAL[], + PRIMARY KEY (user_id, month_start, metric_name) +); +``` + +This is the full query. It’s a bit tricky so read the notes. + +```sql +INSERT INTO array_metrics +WITH +daily_aggregate AS ( + SELECT + user_id, + DATE(event_time) AS date, + COUNT(1) AS num_site_hits + FROM events + WHERE DATE(event_time) = DATE('2023-01-01') + AND user_id IS NOT NULL + GROUP BY user_id, DATE(event_time) +), + +yesterday_array AS ( + SELECT + * + FROM array_metrics + WHERE month_start = DATE('2023-01-01') +) + +SELECT + COALESCE(da.user_id, ya.user_id) AS user_id, + -- we truncate `da.date` because each day it goes up by one, + -- whereas we just need the 1st day of the month + COALESCE(ya.month_start, DATE_TRUNC('month', da.date)) AS month_start, + -- this is hard coded arbitrarily, + -- in a real scenario it's probably a variable in a pipeline or something more dynamic + 'site_hits' AS metric_name, + CASE + WHEN ya.metric_array IS NOT NULL THEN + ya.metric_array || ARRAY[COALESCE(da.num_site_hits, 0)] + WHEN ya.metric_array IS NULL THEN -- (this could be just an ELSE) + -- these date shenanigans just mean: how many days after beginning of the month + ARRAY_FILL(0, ARRAY[COALESCE(date - DATE(DATE_TRUNC('month', date)), 0)]) + || ARRAY[COALESCE(da.num_site_hits, 0)] + -- the reason we have to use `ARRAY_FILL`, is because all arrays for all users + -- have to be of same length. So if a user shows up on the 5th, the first 4 elements + -- of the metric_array must be all zeros, like [0, 0, 0, 0, n] + -- (where n is the number of hits that day). + END AS metric_array +FROM daily_aggregate da + FULL OUTER JOIN yesterday_array ya + ON da.user_id = ya.user_id + +ON CONFLICT (user_id, month_start, metric_name) +DO + UPDATE SET metric_array = EXCLUDED.metric_array; +``` + +To cumulate this, you have to bump the date by one in the 1st CTE every time. + +We use `ON CONFLICT` because contrary to previous examples, in this case we are just “merging” and updating, instead of creating a new partition for each day, since all history that we care about is already contained in the `metric_array` for each month. + +The partitions, like said in the lesson, are `month_start` and `metric_name` as sub-partition. + +If everything was done properly, the next query should have only results with cardinality `2` + +```sql +SELECT + cardinality(metric_array), + COUNT(1) +FROM array_metrics +GROUP BY 1; +``` + +Let’s add one more day, and then do an example aggregate analysis. + +```sql +WITH + +agg AS ( + SELECT + metric_name, + month_start, + ARRAY[SUM(metric_array[1]), + SUM(metric_array[2]), + SUM(metric_array[3])] AS summed_array + FROM array_metrics + GROUP BY metric_name, month_start +) + +SELECT + metric_name, + month_start + CAST(CAST(index - 1 AS TEXT) || ' day' AS INTERVAL) AS date, + elem AS value +FROM agg + CROSS JOIN UNNEST(agg.summed_array) WITH ORDINALITY AS a(elem, index) +``` + +What we’re doing here is essentially finding the total of all `site_hits` (or actually every metric name, if we had more) for each day, and returning them totals for each day. + +![image.png](images/d3la_image.png) + +This operation is very fast, because it’s the minimum set of data that you need to reach this result, and it’s what Zach used at Facebook to save so much time in certain analytical queries. diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 3 - Lecture.md b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 3 - Lecture.md new file mode 100644 index 00000000..6a1d34d9 --- /dev/null +++ b/bootcamp/materials/2-fact-data-modeling/markdown_notes/Day 3 - Lecture.md @@ -0,0 +1,160 @@ +# Day 3 - Lecture + +# Why should shuffle be minimized + +> *Shuffle happens when you need to have all the data from a specific key on a specific machine.* +> + +Big data leverages parallelism as much as it can + +The thing that sucks about shuffle is that it’s the **bottleneck for parallelism.** + +> If you have a pipeline without shuffles, it can be as parallel as you want. That’s because you don’t need a specific chunk of data to be on ALL machines. You can have all of it split out. +> + +Some steps in your big data pipeline are gonna have more parallelism than other steps. + +The more parallel it is, the more effectively we can crunch big data. + +There are several ways to address shuffling and we will address them later on. + +## What types of queries are highly parallelizable + +### **Extremely parallel** + +- SELECT, FROM, WHERE *(without a window function in SELECT)* + + → This query is **infinitely scalable**. + Imagine you have 1B row and 1B machines, and each machine had 1 row, that’s fine and will work seamlessly. + +### **Kinda parallel** + +- GROUP BY, JOIN, HAVING + + → In GROUP BY, the problem is that in order to do the right aggregation, all the rows for a **key** need to be co-located (in the same machine) in order to correctly count them. Otherwise, if the rows are all spread on different machine, how do you know how many there are in total? + In Spark, you can choose the level of GROUP BY parallelism of group by by setting a value for `spark.sql.shuffle.partitions` (default is 200). Will see more of this in week 5. + + → In JOIN, it’s a bit trickier, because you have to do the shuffle not once, but twice [once per each table in the join, I believe — Ed.]. Here, you have all of the keys on the left side, and all keys on the right side, and they all have to be pushed on a machine (one partition in your shuffle partitions, same setting as for GROUP BY). Finally, after shuffling, the comparison between left and right keys can happen. + + → HAVING and GROUP BY kinda go hand in hand. Technically, HAVING is as parallel as the first category (SELECT etc…) because it’s just a filter, but it’s a step **after** GROUP BY, so you can only apply it after a shuffle. + +### **Painfully not parallel** + +- ORDER BY *(at the end of a query, not in a window function)* + + → This one you should almost **NEVER** use in distributed compute. This is the most painful and LEAST parallelizable keyword in SQL. + + Let’s go back to the example with 1B rows. The ONLY WAY to know that the data is sorted, is if it all goes back into one machine, and it all gets passed through 1 machine, which is exactly the opposite of parallel. + + You can use ORDER BY, but only at the end of your queries, after all aggregations and stuff, when the final result is a relatively small amount of rows. + + → **How does it work in a window function instead?** + + Here, ORDER BY doesn’t do a global sort. It can, if you don’t put any PARTITION BY in the window function, like if you do a global `rank` function, but if you use PARTITION, then it’s not a global problem but a shuffle problem. + + In a way, PARTITION BY in a window function and GROUP BY in a regular query behave very similarly in the distributed computing world, especially in the way they affect shuffle. + +The reason we’re talking about all of this in a **fact data modeling** module, is because + + + +If you have data that’s structured in a certain format, you can skip out on using some of these keywords. Good to remember when planning your fact data modeling. + +Remember this paragraph as it’s very useful for not just data modeling, but troubleshooting Spark problems and related problems as well! + +## How to make GROUP BY more efficient? + +- **Give GROUP BY some buckets and guarantees** (bucketing is supported by many things, not just Spark, e.g. Delta, Iceberg, Hudi etc…) + + → This essentially means **pre-shuffling the data.** Imagine you want to put the data into 8 buckets, you’d choose a (usually high cardinality field), and then you bucket on that field. It will perform the grouping (with MOD operator) when you write the data out. + + This way, when we do a GROUP BY, we don’t have to shuffle, because it already has been! + +- **REDUCE THE DATA VOLUME AS MUCH AS YOU CAN!** + +### How reduced fact data modeling gives you superpowers + +**Fact data often has this schema** + +- user_id, event_time, action, date_partition +- very high volume, 1 row per event + +**Daily aggregate often has this schema** + +- user_id, action_count, date_partition +- medium sized volume, 1 row per user per day + +**Reduced fact take this one step further** + +- user_id, action_count ARRAY, month_start_partition or year_start_partition + + → This cuts the daily data by 30 (if you use month) or 365 (if you use year). + +- low volume, 1 row per user per month or year. +- this is as small as you can get + +The key to remember here is you have these 3 flavours. Normal fact data, daily aggregates, and reduced facts. They have tradeoffs (especially 2 and 3) because as the data gets smaller you lose some of the flexibility on what types of analytics you can do on it. Usually that’s worth the trade-off as it allows you to do analytics more quickly. + +**Example fact data** + +![image.png](images/d3le_1.png) + +We have the typical schema described above. `user_id`, `event_time`, `action`, `date`, and other less important info. + +Here it’s the most granular data, and you can ask very specific questions, but if the data is too large, you can’t make this kind of analysis over a large timeframe (e.g. a month). + +In a longer time horizon, using this schema is largely inconvenient. + +**Example daily aggregated data** + +![image.png](images/d3le_2.png) + +In this case, we have one row per **user** per **metric** per **day**. + +We have lost some details here, but you can work on a much longer time horizon now (e.g. 1 or 2 years). Also, this table can be joined with SCDs, and can be done aggregates at the higher level and bring on other dimensions. + +In this table, your partition is not just `date` but `date` AND `metric_name` (as sub-partition). + +However, we can still make this schema smaller and not lose anything. + +**Example long-array metrics** + +![image.png](images/d3le_3.png) + +In this schema, there’s **only one row per month**. And then we have a `value_array`. + +The reason they came up with this model, is because there was the 1st decline in growth at Facebook in the history of the company, and they were panicking and wanting to look at all the metrics in the whole history of Facebook and stuff trying to figure out what was going on etc… They were working with daily metrics, that were pretty fast, but it was not gonna work on longer time frames. + + + +In this schema, the `date` is like an index. For the 1st row, `34` is the number of likes given on July 1st, `3` on July 2nd, and so on until the end of the month. + +This is quite similar to the Lab 2 of this week, with the date list and the bit mask of 1s and 0s. + +--- + +### Reduced fact data modeling - continued + +- Daily dates are stored as an offset of **month_start / year_start** + - First index is for date month_start + zero days + - Last index is for date month_start + array_length - 1 +- Dimensional joins get weird if you want things to stay performant +→ You don’t want to bring in a dimension in the middle of the month. + - Your SCD accuracy becomes the same as month_start or year_start + - You give up 100% accurate SCD tracking for massively increased performance + - You need to pick snapshots in time (month start or month end or both) and treat the dimensions as fixed. +- Impact of analysis + - Multi-year analyses took hours instead of weeks + → A 10 year backfill really took about a week or more in Zach’s situation, before the adoption of this model + - Unlocked “decades-long slow burn” analyses at Facebook +- Allowed for fast correlation analysis between user-level metrics and dimensions diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d1le_image.png b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d1le_image.png new file mode 100644 index 00000000..009937e9 Binary files /dev/null and b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d1le_image.png differ diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3la_image.png b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3la_image.png new file mode 100644 index 00000000..617c7426 Binary files /dev/null and b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3la_image.png differ diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_1.png b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_1.png new file mode 100644 index 00000000..de5700b6 Binary files /dev/null and b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_1.png differ diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_2.png b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_2.png new file mode 100644 index 00000000..18d04aac Binary files /dev/null and b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_2.png differ diff --git a/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_3.png b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_3.png new file mode 100644 index 00000000..d20573b1 Binary files /dev/null and b/bootcamp/materials/2-fact-data-modeling/markdown_notes/images/d3le_3.png differ diff --git a/bootcamp/materials/3-spark-fundamentals/.gitignore b/bootcamp/materials/3-spark-fundamentals/.gitignore index f73806eb..b3899eb4 100644 --- a/bootcamp/materials/3-spark-fundamentals/.gitignore +++ b/bootcamp/materials/3-spark-fundamentals/.gitignore @@ -1 +1,3 @@ -.ipynb_checkpoints/ \ No newline at end of file +.ipynb_checkpoints/ +.venv/ +*.pyc \ No newline at end of file diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 1 - Lab.md b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 1 - Lab.md new file mode 100644 index 00000000..b980c594 --- /dev/null +++ b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 1 - Lab.md @@ -0,0 +1,159 @@ +# Day 1 - Lab + +Find the setup instructions at [this link](https://github.com/DataExpert-io/data-engineer-handbook/blob/6c32b89b9cc845471ebbfae327e71d434d569bc6/bootcamp/materials/3-spark-fundamentals/README.md#spark-fundamentals-and-advanced-spark-setup)! + +Then visit `localhost:8888`, go to `/notebooks` and open `event_data_pyspark`. Run the cell and if it works properly, you’re good to go! + +Spark is managed by this thing called `SparkSession`, which you build with the first command. Usually you create some sort of `appName` , and then you run `.getOrCreate()`. + +You can notice that the syntax is not Pythonic is because Spark is a JVM library, so it doesn’t use `camel_case`. All Pyspark does is wraps the Spark libraries in Python. + +Try running `.collect()` instead of `.show()` in the first cell. It will output all 500k rows, and it might OOM. + +Now try running + +```python +df.join(df, lit(1) == lit(1)).collect() +# what this does is join each row with every other row. It's like a CROSS JOIN +``` + +and most likely you will get a `OutOfMemoryError: Java heap space`. + +For the record, doing `.collect()` on the whole dataset is a bad practice, as **you don’t ever wanna pull the entire dataset into the driver** (unless it’s really small). + +Now restart the Kernel (in the toolbar on top, under Kernel) because probably this operation killed it, and try running this instead + +```python +df.join(df, lit(1) == lit(1)).take(5) +``` + +As we’re bringing just 5 rows back to the driver, this time it will work without problems. + +Let’s move on to the next cell, run this and how these lines work. + +```python +sorted = df.repartition(10, col("event_date"))\ + .sortWithinPartitions(col("event_date"), col("host"))\ + .withColumn("event_time", col("event_time").cast("timestamp")) +``` + +- `repartition` ⇒ regardless of how many partitions we had before, now we will have 10, split by `event_date` + +Now run this too + +```python +sortedTwo = df.repartition(10, col("event_date"))\ + .sort(col("event_date"), col("host"))\ + .withColumn("event_time", col("event_time").cast("timestamp")) + +sorted.show() +sortedTwo.show() +``` + +You’ll see that the difference is that they actually pulled from different data. How is that? `sort` and `sortWithinPartitions` are different, **even more at scale**. + +- `sortWithinPartitions` ⇒ It will sort the data **LOCALLY** for each partition (based on the specified key[s]). +- `sort` ⇒ This is instead a **GLOBAL** sort, which means it has to pull data all inside the driver, and it’s very slow (on large data). + +Let’s run the `.explain()` for both DataFrames, so we can see the difference between them. + + + +As you can see, the first step is reading the CSV. + +```python ++- FileScan csv [device_id#369,browser_type#370,os_type#371,device_type#372] +Batched: false, +DataFilters: [isnotnull(device_id#369)], +Format: CSV, +Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/devices.csv], +PartitionFilters: [], +PushedFilters: [IsNotNull(device_id)], +ReadSchema: struct +``` + +Then there are a bunch more steps (which depend on your code), notably: + +- `Project` → same as `SELECT` +- `Exchange` → where repartition happens. + +Finally we get to the `Sort` line, almost all the way to the top. + +```python ++- Sort [event_date#343 ASC NULLS FIRST, host#334 ASC NULLS FIRST], **false**, 0 +``` + +The boolean in there, represents wether we’re doing a GLOBAL sort or not. In this case, `false` indicates the sort is NOT GLOBAL. + +You will see that in the other plan, there’s a step which doesn’t exist in the 1st one: + +```python ++- Exchange rangepartitioning(event_date#343 ASC NULLS FIRST, + host#334 ASC NULLS FIRST, 200), + ENSURE_REQUIREMENTS, + [plan_id=2429] +``` + +This is the line that’s gonna be painful at scale. Every time you see he word **EXCHANGE** in you query plan, think it means **SHUFFLE**. + +In this case `sort` is causing the shuffle because a GLOBAL sort makes all data pass through an executor, as that’s the only way you can guarantee that your data is globally sorted. So at scale, always use `sortWithinPartition`. In fact, basically never use `sort`. + +Let’s now move on to the cell where we create the tables with SQL and run them all. + +```sql +%%sql + +CREATE TABLE IF NOT EXISTS bootcamp.events ( + url STRING, + referrer STRING, + browser_family STRING, + os_family STRING, + device_family STRING, + host STRING, + event_time TIMESTAMP, + event_date DATE +) +USING iceberg +PARTITIONED BY (years(event_date)); +``` + +Run them all until + +```python +# df_2 = df.select('event_date, host, user_id, referrer, url, event_time, device_id'.split(', ')) +start_df = df.repartition(4, col("event_date")).withColumn("event_time", col("event_time").cast("timestamp")) \ + +first_sort_df = start_df.sortWithinPartitions(col("event_date"), col('browser_family'), col("host")) + +start_df.write.mode("overwrite").saveAsTable("bootcamp.events_unsorted") +first_sort_df.write.mode("overwrite").saveAsTable("bootcamp.events_sorted") +``` + +Try adding and removing sort columns inside `final_sort_df` and see how the result of the next queries change. + +```sql +%%sql + +SELECT SUM(file_size_in_bytes) as size, COUNT(1) as num_files, 'sorted' +FROM demo.bootcamp.events_sorted.files + +UNION ALL +SELECT SUM(file_size_in_bytes) as size, COUNT(1) as num_files, 'unsorted' +FROM demo.bootcamp.events_unsorted.files + +-- This query is pretty dope because it allows you +-- to see how many files an iceberg table has, its size and other info. +``` + +You will notice from the screenshot below that the size of the `sorted` table is about 10% smaller than the `unsorted` one. This is due to **run length encoding**, which Zach explained in a previous lecture. But in short, it means: + +- When writing data out, you want it to be written out from lowest cardinality to highest. + - Event date has low cardinality, and host too, but user_id has high cardinality, so you put it last. +- Sort your data (within partitions!) by these low cardinality fields to reduce output side! + +![image.png](images/d1la_image.png) diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 1 - Lecture.md b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 1 - Lecture.md new file mode 100644 index 00000000..53de1786 --- /dev/null +++ b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 1 - Lecture.md @@ -0,0 +1,293 @@ +# Day 1 - Lecture + +# Intro + +Topics covered in this course: + +- Architecture +- The roles of the Driver +- The roles of the Executor +- How does it all come together? +- Powers +- When to choose Spark? + +Later in the lab we will cover: + +- Partitioning +- Sorting +- Data read/write operations + +# What is Apache Spark? + +It’s a **distributed compute** framework that allows you to process very large amounts of data efficiently. + +It’s a successor of legacy technologies such as Hadoop and MapReduce, then Hive, and now Spark is the predominant choice. + +## Why is Spark so good? + +Spark leverages RAM much more efficiently and effectively than previous generations (those mentioned above). For instance, to do a GROUP BY in Hive or MapReduce, everything had to be written to disk and then read from disk again. It was very resilient, but painfully slow. + +Spark minimizes writing to disk, doing so only when it doesn’t have enough memory for an operation — This is called “**spilling to disk”.** You want to avoid this as much as possible, and use as much RAM as you can. + +Also, Spark is storage agnostic. You can read whatever you want. A database, object storage, a file, mongoDB, whatever. + +## When is Spark not so good? + +- Nobody else in the company knows Spark + - Spark is not immune to the bus factor! +- Your company already uses something else a lot. + - Inertia is often times not worth it to overcome. + - Homogeneity of pipelines matters a lot → it’s better to have 20 BQ pipelines than 19 BQ pipelines and 1 Spark pipeline. + +# How does Spark work? + +Spark has a few pieces to it (with a basketball analogy). + +- The plan → (the play) +- The driver → (the coach) +- The executors → (the players) + +## The Plan + +This is where the “coach” will tell the players what play to do. + +- This is the transformation you describe in Python, Scala or SQL. +- The plan is evaluated lazily + - execution only happens when it needs to. + → In other words, the play happens only when “the player takes a shot”, which means when they try to write data out somewhere, or when they try to *collect* data somewhere. **Collect** means *to take information from the play and bringing it back to the coach, so the coach can tell them what to do next.* + +## The Driver + +- The driver reads the plan (the coach). +- Important Spark driver settings + - `spark.driver.memory` → amount of memory the driver has to process the job. Default is prob 2GB, can go all the way up to 16GB. Usually when you need to bump it up when you have a crazy job with so many steps and plans, or what’s described in the picture below (although that’s considered often bad practice). + - `spark.driver.memoryOverheadFactor` → the driver needs memory to process the plan. If for some reason you have a very complicated plan, you might have more overhead memory (cause the JVM might take up more memory), and you might want to crank it up. **Non-heap** memory is the memory that Java needs to run, not the memory for the plan or `.collect()`, it’s just the memory that Java needs to run, and you only need to bump this up if you have a very complex job. + +![image.png](images/d1le_image.png) + + + +- Driver determines a few things + - When to actually start executing the job and stop being lazy + - How to `JOIN` datasets + - This is very important because depending on what type Spark decides to use, this can mean the job is 10 times more performant, or the exact opposite. + - How much parallelism each step needs + +## Executors (who do the actual work) + +- The driver passes the plan to the executors. +- Three settings to touch here + - `spark.executor.memory` → Has the same constraints as the `driver.memory`, defaults to 1 or 2GB and can go up to 16GB. + - **Bad practice with this setting** → If a job OOMs, people just update this to 16GB and then forget about it. The job will work but will be much more expensive. + - Better idea is to run it at a bunch of different memory levels. 2, 4, 6 GBs and so on. Once you find the one that doesn’t crash for a couple of days, that’s your smallest number you can go with. + - You don’t wanna have too much “padding”, because you waste resources, but you also don’t want to have to little **padding**, because the chance of OOM is annoying and when they break it’s even more expensive. + - `spark.executor.cores` → Generally, an executor can take 4 tasks at once. You can go up higher to 6, and gives more parallelism per executor. But if you have more than 6, another bottleneck happens: the executor’s disk and the throughput between when tasks finish. Zach often doesn’t touch this setting. Another thing to be careful about is that increasing this increases the amount of memory used, and this might cause OOM. + - `spark.executor.memoryOverheadFactor` → This is similar to the driver one. Especially with UDFs you want to bump this number up, as Spark UDFs are quite terrible. It can make a difference if the job is unreliable and has a bit crazy plan. + +![image.png](images/d1le_image%201.png) + +## Types of JOINs in Spark + +**Shuffle sort-merge Join** + +- Default JOIN strategy since Spark 2.3 +- The **least performant** of them all +- But also the **most versatile** → it works no matter what +- Works when both sides of the JOIN are large. + +**Broadcast Hash Join** + +- Works only when one side of the JOIN is small. Then it works pretty well. In that case instead of shuffling the JOIN, it just ships the whole dataset to the executors. Works great for small-ish amounts of data. Zach has been able to broadcast up to 8 to 10 GB, more than that and it will probably create OOM problems. + +**Bucket Join** + +- A JOIN without shuffle, you can do this by pre-bucketing data (more on this later). + + + +## How does shuffle work? + +![image.png](images/d1le_image%202.png) + +As we said before, this is the least scalable part of Spark. As scale goes up, the usage of shuffle gets more and more painful. + +Once you start processing 20-30TBs per day, shuffle goes out of the window, and you have to try to solve the problem in a different way. + +Imagine you have a table you’re reading in, and it has 4 files, and you’re gonna do a MAP operation (e.g. adding a column, or filtering down, etc…). This operation is infinitely scalable (to the number of original files) as it doesn’t need shuffle. + +Then say there’s a GROUP BY operation. The default is 200 partitions (the diagram is a simplified view). + +What ends up happening is, imagine you’re grouping on `user_id`, is that `user_id` is modded by the number of partitions (3 in this case). + +- If 0 → it goes to partition 1 +- If 1 → partition 2 +- if 2 → partition 3 + +This is how shuffle works in its simplest form. + +For a `JOIN`, imagine that file 1 and 2 belong to one table, and 3 and 4 belong to another. Then, again, all `user_ids` will be assigned to the same partitions in the same exact way as before. + +After all ids have been shuffled to their respective partition, THEN you can do the comparison, one by one, and finish the JOIN. + +This is **the default case →** **shuffle sort-merge join**. + +In **broadcast join** instead, files 3 and 4 (the 2nd table) are small enough, so you ship all their data to each of the executors (where file 1 and 2 reside) so that you don’t have to do any shuffling at all (which means, the 2nd half of the diagram doesn’t happen). + +In **bucket join**, the tables has been “pre-shuffled” already. This mean both left and right side of the join get put into an equivalent number of buckets. + +> Editor’s note — bucketing is a technique and has nothing to do with S3 buckets or similar concept. In short, it simply writes out the data in $n$ files, where each file ⇒ 1 bucket. It’s still a grouping operation, because the key you bucket buy gets all sent to the same bucket! +In code, it looks like this +> + +```python +n_buckets = 10 +data.write.bucketBy(n_buckets, "my_bucketing_key").saveAsTable("table_name"); +``` + +The result of this is that the files already have the guarantee that all the data you’re looking for, based on the bucketing key, is in there. Then when you do the JOIN, you just match up the buckets. + +In the diagram example, imagine that the data has been bucketed, so you line up File 1 and File 3, and File 2 and File 4, and perform the JOIN without a shuffle. This gives a massive performance gain. + +Moreover: even if the two tables don’t have the same number of buckets, you can do a bucket join assuming they’re **multiples of each other**. Imagine in this case instead we had 1 table with 2 buckets, and another with 4 buckets. + +Then (say) files 1 and 2 would line up with just file 3. + +The lesson here is: + + + +Don’t be weird and use odd number of buckets like 7 or 13, because then the only way to obtain a bucket join is if the other side ALSO has the same number of buckets or a weird multiple of it. + +**How do you pick the number of buckets?** + +Often it’s based on the volume of the data. + +E.g. in Facebook, Zach was processing 10 TBs of data, so 1024 buckets would lead to 10 GBs per file. This is kind of the idea. + +The rule of thumb here is: if you have too many buckets on data that’s too small, then you have the same problem where there might not be any data in one of the buckets because of the `mod` problem (because of the [**pigeonhole principle**](https://en.wikipedia.org/wiki/Pigeonhole_principle)) → Imagine if you have 1000 rows but 1024 buckets. 24 buckets WILL be empty! *[Also, there’s a bit of overhead due to the I/O issue of opening a lot of empty or super small files — Ed.]* + +### Shuffle + +Shuffle partitions and parallelism are linked! + +Two settings + +- `spark.sql.shuffle.partitions` +- `spark.default.parallelism` + +For the most part, they’re quite the same, except 1 tiny exception if you’re using the RDD API directly. + +Their value is essentially the number of partitions that you get AFTER the shuffle operation (e.g. JOIN or GROUP BY). + +**Is shuffle good or bad?** + +Don’t think that shuffle is always inherently bad. It depends on the situation. + +- If the volume is low-to-medium ( < 10 TBs ) → it’s really good and makes our lives easier. +- At high volumes ( > 10 TBs ) + - Very painful! + - At Netflix, a shuffle killed the IP enrichment pipeline. + This was explained in a previous lecture, but the TL;DR was they got rid of the shuffle entirely by solving the problem upstream *[shift left — Ed.],* by having the devs log the data they were joining directly in the event logs. *[The change was more organizational than data oriented, as he had to talk with hundreds of people, all maintaining different microservices — Ed.]*. + + + +This last example shows how taking a different approach was way more beneficial than trying to power through it via Spark. + +### How to minimize shuffle at high volumes? + +- Bucket the data if multiple JOINs or aggregations are happening downstream + - This is a **waste** is all you have to do is **ONE SINGLE JOIN**. That’s because you have to pay the shuffle cost at least once anyway to bucket your data! + - *Also: Presto can be weird with bucketed tables, especially if you have tables with small number of buckets (e.g. 16). If you query stuff with Presto the query might be slower because the initial parallelism would always be 16, and maybe Presto was expecting more.* +- Spark has the ability to bucket data to minimize or eliminate the need for shuffle when doing JOINs. +- Bucket joins are very efficient but have drawbacks + - Main drawback is that initial parallelism = number of buckets +- Bucket joins **only work** if the two number of buckets are multiples of each other! + - **As said before, ALWAYS use powers of 2 for # of buckets!!!** + +### Shuffle and Skew + +Sometimes some partitions have dramatically more data than others. + +This can happen because: + +- Not enough partitions +- The natural way the data is + - Beyonce gets a lot more notifications than the average Facebook user + +This is problematic because the heavy partition (Beyonce) will be put all into 1 executor, which might go OOM and therefore fail the job. Imagine if this happens after the job has ran for several hours and then it fails at 99%. + +**How to tell if your data is skewed?** + +- Most common is a job getting to 99%, taking forever (way longer than you expect the job to take), and failing +- Another, more scientific way is to do a box-and-whiskers plot of the data to see if there’s any extreme outliers + +**Ways to deal with skew** + +- Adaptive query execution - only in Spark 3+ + - Set `spark.sql.adaptive.enabled = True` + - This alone should solve your problem. Nothing else to do. Period. + - Don’t set this to `true` all times just in case. It makes the job slower. +- Salting the GROUP BY - best option before Spark 3. +*Works only with a skewed GROUP BY, not a skewed JOIN.* + - Add a column with random numbers as value, and that breaks up the skew. + - Then you GROUP BY twice: once by this random number, which breaks up the skew, by distributing it across the executors. This gives you a partial aggregation. And then GROUP BY again, removing the random number and aggregate everything up again, and that will give you the same thing as you would get otherwise. + - Be careful using this when it comes to **additive vs non additive dimensions.** In most cases you’ll be fine, but some, might give unpredictable results! + - Also be careful with things like AVG aggregations. Break it into SUM and COUNT and divide! + - In JOINs, you can’t do the salting, so there are other strategies. + - One is to exclude the big outliers (Beyonce can wait tomorrow to be joined) + - Another is to partition the table and have essentially two pipelines (one for the big partitions, and another for the normal data). + +# Spark on Databricks vs regular Spark + +![image.png](images/d1le_image%203.png) + +There’s not a lot to add here aside from the above table. In short, Spark on DBX is all notebook based, which Zach doesn’t like [and neither do I — Ed.], since notebooks don’t encourage good engineering practices, but at the same time they enable less technical people to use Spark, which is not all too bad. + +# Miscellaneous + +**How to look at Spark query plans** + +Use `.explain()` on your dataframes. + +- This will show you the join strategies that Spark will take. + +**Where can Spark read data from** + +Everywhere + +- From the lake + - Delta Lake, Apache Iceberg, Hive metastore +- From an RDBMS + - Postgres, Oracle, etc… +- From an API + - Make a REST call and turn into data + - Be careful because this usually happens on the Driver! + - Keep in mind that if you need to read from a database that also exposes an API, it’s often preferable to go one level up and read from the DB directly rather than using the API, as it performs better! +- From a flat file (CSV, JSON, etc…) + +**Spark output datasets** + +- Should almost always be partitioned on “date” + - Execution date of the pipeline + - In big tech, this is called “ds partitioning” (in Airflow, this is `{{ds}}`). + - If you don’t partition on date, your tables are gonna get way too big and bulky. diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 2 - Lab.md b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 2 - Lab.md new file mode 100644 index 00000000..4c652a7b --- /dev/null +++ b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 2 - Lab.md @@ -0,0 +1,318 @@ +# Day 2 - Lab + +In this lab we will use Scala, and take a look at the **Dataset API.** + +Open the notebook `DatasetApi.ipynb`. + +Take a look at the 1st cell + +```scala +import org.apache.spark.sql.SparkSession + +val sparkSession = SparkSession.builder.appName("Juptyer").getOrCreate() + +case class Event ( + //Option is a way to handle NULL more gracefully. + // In other words, it means it's nullable. + user_id: Option[Integer], + device_id: Option[Integer], + referrer: Option[String], + host: String, + url: String, + event_time: String +) +``` + +Just by using this `case class`, we can guarantee that `host`, `url` and `event_time` are never null, otherwise the pipeline will fail. This, again, is NOT AVAILABLE in SparkSQL or DataFrame API. + +What we’re gonna do in this lab is take a bunch of tables and join them together (**event + device → eventWithDeviceInfo**). + +A little more below, you will see this piece of code + +```scala +val events: Dataset[Event] = sparkSession.read.option("header", "true") + .option("inferSchema", "true") + .csv("/home/iceberg/data/events.csv") + .as[Event] // notice this last line +``` + +The idea here is we read a CSV, and `as[Event]` describes that the `events` is now a **Dataset** of **Event** + +The same is true for the other Dataset in this lab (Device). + +What this does is it gives you the ability to work with this data in Scala directly. See these lines + +```scala +val filteredViaDataset = events.filter(event => event.user_id.isDefined && event.device_id.isDefined) +val filteredViaDataFrame = events.toDF().where($"user_id".isNotNull && $"device_id".isNotNull) +val filteredViaSparkSql = sparkSession.sql("SELECT * FROM events WHERE user_id IS NOT NULL AND device_id IS NOT NULL") +``` + +They do exactly the same thing, but: + +- In the 1st one, we’re working with the Dataset directly. +- In the second, we’re using the DataFrame API. +- The third uses SparkSQL. + +As you can see, the Dataset API is quite convenient. + +Let’s now do a join, starting with SQL + +```scala +//Creating temp views is a good strategy if you're leveraging SparkSQL +filteredViaSparkSql.createOrReplaceTempView("filtered_events") +val combinedViaSparkSQL = spark.sql(f""" + SELECT + fe.user_id, + d.device_id, + d.browser_type, + d.os_type, + d.device_type, + fe. referrer, + fe.host, + fe.url, + fe.event_time + FROM filtered_events fe + JOIN devices d ON fe.device_id = d.device_id +""") +``` + +The we follow with DataFrame + +```scala +// DataFrames give up some of the intellisense because you no longer have static typing +val combinedViaDataFrames = filteredViaDataFrame.as("e") + //Make sure to use triple equals when using data frames + .join(devices.as("d"), $"e.device_id" === $"d.device_id", "inner") + .select( + $"e.user_id", + $"d.device_id", + $"d.browser_type", + $"d.os_type", + $"d.device_type", + $"e.referrer", + $"e.host", + $"e.url", + $"e.event_time" + ) +// the `$` is the equivalent of `col('e.user_id')`. +``` + +Last one is with the Dataset API + +```scala +// This will fail if user_id is None, which is why we have to manage nulls +// and use this syntax: `user_id.get` +// Alternative you can write `getOrElse(..), which is like COALESCE +// Here in specific we can use only .get +// because we're filtering out null ids in a previous step +val combinedViaDatasets = filteredViaDataset + .joinWith(devices, events("device_id") === devices("device_id"), "inner") + .map{ case (event: Event, device: Device) => EventWithDeviceInfo( + user_id=event.user_id.get, + device_id=device.device_id, + browser_type=device.browser_type, + os_type=device.os_type, + device_type=device.device_type, + referrer=event.referrer.getOrElse("unknow"), + host=event.host, + url=event.url, + event_time=event.event_time + ) } +``` + +What’s nice about this last join is you get access to the left and right side of the join, their schemas and type, and it’s really easy to map everything to the new schema. + +Another thing about Dataset API is that it’s really nice with UDFs. Imagine we create a function like `toUpperCase` (it already exists, but it’s just for simplicity of the example). + +```scala +def toUpperCase(s: String): String ( + return s.toUpperCase() +) + +// What you can do with this, with the Dataset API, you can call `.map` again, and do + +.map( case (row: EventWithDeviceInfo) => { + row.browser_type = toUpperCase(row.browser_type) + return row +}) +``` + +W.r.t. the DataFrame API, this example above is much simpler. Let’s see the difference + +```scala +val toUpperCaseUdf = udf(toUpperCase _ ) + +// then in the schema definition +// [..] +toUpperCaseUdf($"d.browser_type").as("browser_type") +``` + +Another nice thing from these APIs is that you can create dummy data very easily, e.g. + +```scala + +val dummyData = List( + Event(user_id=Some(1), device_id=Some(2), referrer=Some("linkedin"), host="eczachly.com", url="/signup", event_time="2023-01-01"), + Event(user_id=Some(3), device_id=Some(7), referrer=Some("twitter"), host="eczachly.com", url="/signup", event_time="2023-01-01") + ) +``` + +--- + +Now open the `Caching.ipynb` notebook and run the first cell. + +We have two datasets here, `users` and `devices`, in a many-to-many relationship. What we’re doing here is create both sides → A users and all of their devices, and a device and all of its users. + +We’re gonna start with this + +```scala +val eventsAggregated = spark.sql(f""" + SELECT user_id, + device_id, + COUNT(1) as event_counts, + COLLECT_LIST(DISTINCT host) as host_array + FROM events + GROUP BY 1,2 +""") +``` + +The pipeline is quite trivial, it’s to showcase the re-use of the Dataframe. In fact it’s later used in two different spots: + +```scala +val usersAndDevices = users + .join(eventsAggregated, eventsAggregated("user_id") === users("user_id")) + .groupBy(users("user_id")) + .agg( + users("user_id"), + max(eventsAggregated("event_counts")).as("total_hits"), + collect_list(eventsAggregated("device_id")).as("devices") + ) + +val devicesOnEvents = devices + .join(eventsAggregated, devices("device_id") === eventsAggregated("device_id")) + .groupBy(devices("device_id"), devices("device_type")) + .agg( + devices("device_id"), + devices("device_type"), + collect_list(eventsAggregated("user_id")).as("users") + ) +``` + +Now run `eventsAggregated.unpersist()`, then remove the `.cache()` call and rerun the first cell. Copy the first query plan, then run it again but this time with `.cache()` for `eventsAggregated`. + +Now go to [Diffchecker](https://www.diffchecker.com/), and paste one query plan in one box and the other query plan in the other box, then check the differences. + +You will see that most changes are just about ids and numbers, but one in specific is `InMemoryTableScan`, which is the indication that the table was read from memory, i.e. from the cache. + +> Keep in mind that cache is only going to be useful if you’re going to **reuse** the cached data, and not just use it once! +> + +Remember there’s several ways to cache: + +- Memory only → `StorageLevel.MEMORY_ONLY` +- Disk → `StorageLevel.DISK_ONLY` + - This one however doesn’t make make a lot of sense. If you want to persist data to disk, it’s much better to just `saveAsTable('tableName')`, which can be easily queried separately. +- Both → `StorageLevel.MEMORY_AND_DISK` or simply `.cache()` + +--- + + + +Now let’s open the last notebook, `bucket-joins-in-iceberg.ipynb`. + +We have 2 datasets, `matches` and `matchDetails`, which can be joined on `match_id` . + +Let’s start by creating the tables, by running only this code + +```scala +import org.apache.spark.sql.functions.{broadcast, split, lit} + +val matchesBucketed = spark.read.option("header", "true") + .option("inferSchema", "true") + .csv("/home/iceberg/data/matches.csv") +val matchDetailsBucketed = spark.read.option("header", "true") + .option("inferSchema", "true") + .csv("/home/iceberg/data/match_details.csv") + +spark.sql("""DROP TABLE IF EXISTS bootcamp.matches_bucketed""") +val bucketedDDL = """ +CREATE TABLE IF NOT EXISTS bootcamp.matches_bucketed ( + match_id STRING, + is_team_game BOOLEAN, + playlist_id STRING, + completion_date TIMESTAMP + ) + USING iceberg + PARTITIONED BY (bucket(16, match_id)); + """ +spark.sql(bucketedDDL) + +val bucketedDetailsDDL = """ +CREATE TABLE IF NOT EXISTS bootcamp.match_details_bucketed ( + match_id STRING, + player_gamertag STRING, + player_total_kills INTEGER, + player_total_deaths INTEGER + ) + USING iceberg + PARTITIONED BY (bucket(16, match_id)); +""" +spark.sql(bucketedDetailsDDL) +``` + +You can see that in the table DDL, we specify the partition schema (`PARTITIONED BY…`). But how you write out the data matters too. Notice the `bucketBy(16, "match_id")`. + +```scala +matchesBucketed.select( + $"match_id", $"is_team_game", $"playlist_id", $"completion_date" + ) + .write.mode("append") + .bucketBy(16, "match_id") + .saveAsTable("bootcamp.matches_bucketed") + +matchDetailsBucketed.select( + $"match_id", $"player_gamertag", $"player_total_kills", $"player_total_deaths") + .write.mode("append") + .bucketBy(16, "match_id").saveAsTable("bootcamp.match_details_bucketed") +``` + +After this one runs, if you run + +```scala +spark.sql("select * from bootcamp.match_details_bucketed.files").count() +``` + +You will see that it amounts to 16, exactly our number of required buckets. + +Finally, you can run this + +```scala +spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") + +matchesBucketed.createOrReplaceTempView("matches") +matchDetailsBucketed.createOrReplaceTempView("match_details") + +spark.sql(""" + SELECT * FROM bootcamp.match_details_bucketed mdb JOIN bootcamp.matches_bucketed md + ON mdb.match_id = md.match_id +""").explain() + +spark.sql(""" + SELECT * FROM match_details mdb JOIN matches md ON mdb.match_id = md.match_id +""").explain() +``` + +You will notice that the plan for the 1st query, that uses bucketed tables, **does NOT have EXCHANGE** steps. In other words, it’s not performing a shuffle! diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 2 - Lecture.md b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 2 - Lecture.md new file mode 100644 index 00000000..6d0c2016 --- /dev/null +++ b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 2 - Lecture.md @@ -0,0 +1,153 @@ +# Day 2 - Lecture + +# Intro + +In this lecture we will talk about + +- difference between Spark Server vs Spark Notebook +- PySpark and Scala Spark usage scenarios +- Implications of using UDFs +- PySpark UDF vs Scala Spark UDF +- DataFrame, SparkSQL, RDD, DataSet APIs + +This lecture is a bit all over the place, miscellaneous, so it doesn’t have a lot of linearity. + +# Spark Server vs Spark Notebooks + +- Spark Server (how Airbnb does it) → Where you have to submit your Spark jobs via CLI + - Every run is fresh, things get uncached automatically + - Nice for testing +- Notebook (how Netflix does it) → You have just 1 Spark session that stays live and you have to terminate it later. + - Make sure to call `.unpersist()` + +## Databricks considerations + +- Should be connected with Github + - PR review process for EVERY change + - CI/CD check + +The problem with DBX is that any change you make on a notebook gets immediately picked up, so you can potentially fuck up production with bad data in the bat of an eye. Notebooks can just be changed if they’re not checked in and part of a VCS and CI/CD pipeline, and this is dangerous. + +# Caching and temporary views + +**Temporary views** + +- Is kinda like a CTE +- The problem with temp views is that if you use it multiple times downstream, they’re gonna get recomputed every time, UNLESS CACHED (`.cache()`). + +**Caching** + +- Storage levels + - MEMORY_ONLY → Really fast + - DISK_ONLY + ”Materialized view” and caching to disk are more or less the same thing. It just means writing it out. + - MEMORY_AND_DISK (the default) +- Caching really only is good if it fits into memory. *[Zach doesn’t really recommend caching to disk — Ed.]* + - Otherwise there’s probably a staging table in your pipeline you should add! +- In notebooks + - Call `.unpersist()` when you’re done otherwise the cached data will just hang out! + +## Caching vs Broadcast + +- Caching + - Stores pre-computed values for re-use + - Stays partitioned +- Broadcast Join + - Small data that gets cached and shipped in entirety to each executor (not partitioned anymore) — A couple of GBs is prob the maximum to work with in this one. + +In other words, the difference is that in caching, each executor is only going to place in memory their fraction of the data. + +Example: say the data is 100GBs, but you have 200 partitions, then you can cache it, because each executor is gonna have 4 tasks, each task is gonna get 500MBs so 2GBs per executor, and that’s how you can cache this. + +### Broadcast join optimization + +Broadcast JOINs prevent shuffle. For the most part, it gets trigger automatically, when it can. + +Like said before, the way it works is one side of the JOIN is “small”, it gets broadcasted to all executors. + +This means, the dataset doesn’t get shuffled around, but instead shipped to all executors. + +The setting `spark.sql.autoBroadcastJoinThreshold`, which defaults to **10MB,** is what determines the maximum data size for broadcast join to happen automatically. + +You can crank this up even 200x, as long as you have enough memory. You can go up to some gigabytes, but definitely not 10s of gigabytes or more. Single digit gigabytes are still somewhat ok. + +> You can explicitly wrap a dataset with `broadcast(df)` too +> +- This will trigger the broadcast regardless of the size of the DF. This is more deterministic than just hoping your little table size stays below the threshold you set, because when it goes a above, then you have to go and update your code. +- Doing this explicitly is probably recommended because other people reading your code will understand your intent. + +# UDFs + +Stands for User Defined Function. They allow you to do all sorts of complex logic and processing when you are in the DataFrame world. It’s quite powerful but there are some gotchas. + +One problem is with Python UDFs: + +- When Spark (running on the JVM) hits a Python UDFs, what it does is it serializes the data and then passes it to a Python process, which will run your code, return some value, serialize it back and send it back to Spark where it deserialize it again. +- There are a lot of steps, so as you can imagine, this makes PySpark UDFs a lot less performant in Spark. + +> Howevear, Apache Arrow has made all these nasty SerDe steps a lot better, so Python UDFs now have become **mostly** in line with Scala Spark UDFs. +> + +There are still some performance hits in Python when we’re considering UD**A**Fs, which are **A**ggregating functions. + +So when people ask “should one use Scala Spark or PySpark”, this is about it: + +- Scala Spark give better performing UDAFs, but it’s a very niche case. +- The **DataSet API** you get in Scala, which you don’t get in Python. + +**Dataset API** + +This allows you to do completely functional programming with your Scala Spark code, where you don’t have to do any DataFrame or declarative programming. + +# DataFrame vs Dataset vs SparkSQL + +Dataset is Scala only! + +Zach thinks of this as a continuum, where on one side you have SparkSQL, and on the other you have Dataset. + +**SparkSQL** + +- Lowest barrier to entry +- Useful when you have many cooks in the kitchen (e.g. analysts, data scientists, etc…) +- Quickest to change +- For short-lived pipelines, that require fast and frequent iterations + +**DataFrame** + +- Modularize the code +- Put stuff into separate functions +- Make things more testable +- For pipelines that don’t need to be iterated as quickly, and for the long haul + +**Dataset** + +- The closest to software engineering +- Allows you to create fake mock data a lot more easily +- Handles NULL better +→ you have to model it (e.g. declare columns nullable), otherwise you get an exception when you encounter nulls. **DataFrame and SparkSQL can’t do this!** +- Best for pipelines that you need for the long haul and you’re already in Scala! + +# Parquet + +- Used by default by Iceberg +- Amazing file format + - Run-length encoding allows for powerful compression +- Don’t use global `.sort()` +- Use `.sortWithinPartitions` + - Parallelizable, gets you good distribution + +# Spark tuning + +- Executor memory + - Don’t just set it to 16GBs and call it a day, it’s a waste +- Driver memory + - Only needs to be bumped up if: + - You’re calling `df.collect()` + - Have a very complex job +- Shuffle partitions + - Default is 200 + - **Aim for ~100-200 MBs per partition to get the right sized output datasets** (that is, assuming your data is uniform, which often won’t be) + - But it’s not a rule fixed in stone, try a range of values to see which one performs best +- AQE (adaptive query execution) + - Helps with skewed datasets, wasteful if the dataset isn’t skewed + - Don’t just enabled it by default on the off-chance that your dataset is skewed diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 3 - Lab.md b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 3 - Lab.md new file mode 100644 index 00000000..48338c5d --- /dev/null +++ b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 3 - Lab.md @@ -0,0 +1,57 @@ +# Day 3 - Lab + +In this lab, we will take a look at testing in Spark. + +We need to install the `requirements.txt` at `bootcamp/materials/3-spark-fundamentals/`. + +```python +python -m venv .venv +pip install -r requirements.txt + +# then run this to check that everything works +python -m pytest +``` + +Check in the `src/jobs/` directory, you will see 3 files with the suffix `_job.py`. These are all Spark jobs. + +One thing one can do to make Spark jobs more testable, is to create a function outside where you build and write (e.g. `main`), that just does the transformation logic. + +When you do that, then in your test code you can test just the transformation logic, as we don’t really need to test if the Spark session gets created, or if the data gets written out. + +The main thing we want to test 99.99% of the times is if the transform logic works. + +Let’s now look at the `tests/` folder, the file `conftest.py`. + +What this file does is it gives us Spark anywhere it is referenced in a test file (this is common PyTest behavior thanks to [fixtures](https://docs.pytest.org/en/6.2.x/fixture.html)). + +Now go to `players_scd_job.py` and add a +1 to `start_date` in aggregated, just to mess around, then run `pytest` in the terminal. In the output you will see that chispa will start yelling at you, also showing where your data output is wrong w.r.t. expected. + +How do you generate fake data for the test? You literally write it down, like this: + +```python +source_data = [ + PlayerSeason("Michael Jordan", 2001, 'Good'), + PlayerSeason("Michael Jordan", 2002, 'Good'), + PlayerSeason("Michael Jordan", 2003, 'Bad'), + PlayerSeason("Someone Else", 2003, 'Bad') +] +source_df = spark.createDataFrame(source_data) + +expected_data = [ + PlayerScd("Michael Jordan", 'Good', 2001, 2002), + PlayerScd("Michael Jordan", 'Bad', 2003, 2003), + PlayerScd("Someone Else", 'Bad', 2003, 2003) +] +expected_df = spark.createDataFrame(expected_data) +``` + +Let’s now create a new job and a new test [not really, it’s already written — Ed.]. + +It’s in the file `team_vertex_job.py`. You will see there’s two functions, besides the query. + +1. `main` — creates the Spark session, calls the transform function and writes out. +2. `do_team_vertex_transformation` — effectively runs the transformation (the sql query). + +Now open `test_team_vertex_job.py`. Notice how `namedtuple` is used to create rows of “fake” data. + +Then follow along the python file, it’s easy to understand by reading the code. diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 3 - Lecture.md b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 3 - Lecture.md new file mode 100644 index 00000000..7b817ef2 --- /dev/null +++ b/bootcamp/materials/3-spark-fundamentals/markdown_notes/Day 3 - Lecture.md @@ -0,0 +1,163 @@ +# Day 3 - Lecture + +# Intro + +This lecture will talk about Unit testing and Integration testing for Spark pipelines. + +# Where can you catch (data) quality bugs? + +- In development (best case) +- In production, but not showing up in tables (still good) + - Following the Write Audit Publish (WAP) pattern, your audit fails so you don’t publish to production + - you want to minimize the amount of times that you’re doing this especially if you have a lot of checks that can give false positives +- In production, in production tables (terrible and destroys trust) + - In this case, the quality error bypasses your checks and ends up in prod. + - Usually a data analyst will scream at you + - Sometimes they can go unnoticed + +**How do you catch bugs in dev?** + +Unit tests and integration tests of your pipelines. + +In this case, a unit test can mean if you have UDFs, or other functions that perform very specific things, you want to write a test for each of those functions. **Especially and critically** if these tests call another library. So that if that library changes without you knowing, and creates a bug, your test will catch it and the issue can be fixed before it goes in prod. + +**How do you catch bugs in production?** + +Use the WAP pattern. + +**What’s the worst situation?** + +- Data analyst finds bugs in production and yells at you +- Ruins trust +- Ruins mood +- Nobody wins + +Actually another bad situation can happen, which is the wrong data doesn’t get spotted immediately, and wrong decisions happen because of this data. This is the probably the worst case scenario ever. + +# Software engineering has higher quality standards than data engineering + +Why? + +- Risks + - If Facebook website goes down, does it lose more revenue from that, or from a data pipeline that goes down? + - Frontend being non-responsive stops the business too + - A lot of times in data it’s ok if things break even for a day or two + - Consequences in SWE when things break is 1 or 2 orders of magnitude higher than in data engineering, and also more immediate +- Maturity + - SWE is a more mature field + - Test-driven development and behavior-driven development are new in data engineering +- Talent + - Data engineers come from a much more diverse background than SWEs + +# How will data engineering become riskier + +Zach was working at Airbnb and they had this ML model that was called “smart pricing”; the thing is most hosts don’t pick their price, they let AI figure it out. + +This algorithm was responsible for a very large chunk of Airbnb revenues. If this model is trained on data that is 1 or 2 days delayed, how much is lost due to this, for every day the data is delayed? + +- Data delays impact machine learning + - Every day that notification ML was behind resulted in a ~10% drop in effectiveness and click through rates +- Data quality bugs impact experimentation +→ If you have dq bugs that aren’t found for a sufficiently large period of time, people might get the wrong idea around their experiments, in that the bugs create consequences that you cannot rely upon, but you don’t know they exist, and so you end up trusting something you shouldn’t. + + + +When you build a pipeline, think what are the consequences of something breaking, and what you can do to mitigate them. + +In other words, we have to level our game as the organizations start to rely more and more on data. + +## Why do most organizations miss the mark on data engineering quality? + +So many places Zach worked at, they don’t do quality right, at all. + +- When Zach worked at Facebook, for the 1st 18 months he didn’t write a single DQ check. +- One reason is that they had this culture of “move fast and break things”, in order to iterate quickly, and so the quality check part was kind of a “nice-to-have”. +- Data analytics is about answering questions as quickly as possible, but doesn’t have the same culture of automated excellence +→ However, you can actually have both speed and automated excellence! +- “This chart looks weird” is enough for an alarming number of analytics organizations +- These are part of the reason Zach left Facebook. + +Goes to show that even Facebook, that has some of the best engineers in the world, can and has missed the mark on data quality. + +## Tradeoff between business velocity and sustainability + +Business wants to answer questions as quickly as possible. “I need this answer yesterday — can you quickly pull this data for me?”. + +And of course you know that quickly is not often feasible, so how do you decide, do you answer questions quickly, or do you avoid making tech debt, or do you stay in the middle? + +- Business wants answers fast +- Engineers don’t want to die from a mountain of tech debt +- Who WINS? +- Depends on the strength of your engineering leaders and how much you push back! +- **Don’t cut corners to go faster!** + +In Zach’s experience, strong engineering managers usually lean more heavily in **not creating tech debt**, by going a little bit slower on answering business questions. However, while working with weak leaders that don’t understand how to push back, then essentially “**analytics eats the data engineering lunch”** and you get burned out. + +Data engineers build roads and answers, not answer questions. If you spend time only answering question, your road will be shitty, bumpy, and filled with dirt. You want to take the time to build the highway! Obviously it takes longer, but it reaps more benefits in the long run! + + + +# Data engineering capability standards will increase over time + +Since it’s engineering, it’s gonna get better and better over time. Think about the first Tesla, it was super expensive, and then it got cheaper and faster and self-driving in subsequent models. + +Here’s 4 ways that as data engineers we’re going to create more values over time: + +- Latency + - Solved with streaming pipelines and microbatch pipelines + → so that data is available sooner, and processed sooner etc… For things like fraud detection this is important! + - Another different thing between streaming and batch, it runs 24/7, whereas a batch pipeline runs once or twice a day. If you’re running something 24/7, the chance something goes wrong is a lot higher, so the quality standards must be a lot higher! +- Quality + - Solved with best practices and frameworks like Great Expectations +- Completeness + - Solved with communication with domain experts + → this is very important because you need to understand the bigger picture! +- Ease-of-access and usability + - Solved by data products and proper data modeling + - Most people / data engineers think: “our product is we give people a data table / dashboard and they can query it or play with filters etc…” + - **There’s way more things you can do**. For instance, when Zach was at Netflix, he built a product that surfaced all the data via API call, so you could just hit an URL and get all data you want + - You can have automated integrations that way, where people can read and write to whatever data store that you have + +As data engineers, we’re going to do more, and that will require us to hold ourselves to higher quality standards. + +# How to do data engineering with a software engineering mindset + +How do we create code and systems that follow the software engineering mindset? + +- Code is 90% read by humans and 10% executed by machines + - You want to write code that’s meant for humans to read, not for machines to run. That’s a very powerful thing to do, because the more readable your code is the easier it is to troubleshoot, and that’s worth way more than a 5% pipeline speed improvement. +- Silent failures are your enemy + - That’s the type of bugs that you don’t ever want to have as much as you can + - A lot of times the better thing to do is to `raise` the exception, as that’s going to be better than a silent failure. So if it crashes it’s gonna alert you and you can act upon it and see what’s wrong. +- Loud failures are your friend + - Fail the pipeline when bad data is introduced + → You can even create a special exception that is specific to the problem + - Loud failures are accomplished by testing and CI/CD +- DRY code and YAGNI principles + - DRY → Don’t Repeat Yourself → Encapsulate and re-use logic + - YAGNI → You Aren’t Gonna Need It → Build things out AS they’re needed, don’t try to build out the master, awesome pipeline form the get go + - SQL and DRY are very combative + - SQL is quite hard to encapsulate and test [dbt to the rescue! (lol) — Ed.] + - Design documents are your friend + - Care about efficiency! + - DSA, Big(O) notation, etc… + - Understanding JOIN and shuffle → how many operations happen during a join, is it a cartesian product, etc… + +## Should data engineers learn software engineering best practices? + +Short answer: YES! + +If the role is more analytics focused, these best practices are less valuable probably, but if you learn these practices you don’t have to be a Data Engineer, you can be a SWE if you want. It opens the door to many other opportunities. + +- LLMs and other things will make the analytics and SQL layer of data engineering job more susceptible to automation +- If you don’t want to learn these things → Go into analytics engineering! diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1la_image.png b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1la_image.png new file mode 100644 index 00000000..1a65f680 Binary files /dev/null and b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1la_image.png differ diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 1.png b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 1.png new file mode 100644 index 00000000..81a5c8be Binary files /dev/null and b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 1.png differ diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 2.png b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 2.png new file mode 100644 index 00000000..8fff2cb7 Binary files /dev/null and b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 2.png differ diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 3.png b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 3.png new file mode 100644 index 00000000..f483c9b4 Binary files /dev/null and b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image 3.png differ diff --git a/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image.png b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image.png new file mode 100644 index 00000000..e9ef4447 Binary files /dev/null and b/bootcamp/materials/3-spark-fundamentals/markdown_notes/images/d1le_image.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 1 - Lab.md b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 1 - Lab.md new file mode 100644 index 00000000..13851fb8 --- /dev/null +++ b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 1 - Lab.md @@ -0,0 +1,225 @@ +# Day 1 - Lab + +# Setup + +This requires a bit of setup. + +First thing to do is to grab the keys from either Zach’s email or discord. Look for these 2 things: + +- `*KAFKA_WEB_TRAFFIC_SECRET*` +- `*KAFKA_WEB_TRAFFIC_KEY*` + +Then from the website [https://www.ip2location.io/](https://www.ip2location.io/) you want to register and grab a free API key. + +Copy the file `example.env` and rename the copy to `flink-env.env`. Then paste your values there and don’t commit them anywhere. + +Then, have docker running and run `make up`. Visit [`localhost:8081`](http://localhost:8081) to ensure everything is working correctly. + +Then, we also need Postgres container from week1 to be running. Then open a postgres console or something where you can run queries, and run the code found in `sql/init.sql`. This will create a table so that Postgres can actually collect the data. + +As a next step, you need to run `make job`. Then go to the Flink console → Running Jobs, you should see a single job running called `insert-into_default_catalog.default_database.processed_events`. Click on it and make sure it has no errors. If it does, check that the port Postgres is running on is actually `5432` and not some other port. + +Last thing, open Postgres and run + +```sql +SELECT * FROM processed_events; +``` + +If you see some events, then you did everything correctly. + +# Lab + +In this lab we will be working with a job (`start_job.py`) that looks at a Kafka topic, which is the one for **DataExpert** → every time there’s a pageload in DataExpert, it puts a record into Kafka. + +The topic name is `bootcamp-events-prod`, and that’s where it’s gonna be writing its values. What the job does is taking the data from Kafka, hits the IP locator API and geocodes that IP address, so we can see where that traffic is coming from. + +--- + +Open the file `start_job.py` in week 4 folder, and go all the way down to the function `log_processing`. This is the start of the Flink job, where all definitions and stuff are gonna go. From here we can see how the job is actually set up. + +Let’s go kinda line by line: + +```python +def log_processing(): + # --- Set up the execution environment + + # First thing, you want to get the "stream execution environment" + # Cause flink can also do `enable_batch_mode`, which works, but behaves differently + env = StreamExecutionEnvironment.get_execution_environment() + # These are milliseconds so checkpointing here is every 10 seconds. + env.enable_checkpointing(10 * 1000) + + # --- Set up the table environment + # 👉 t_env kinda allows us to work with the spark equivalent of a dataframe + # Here we're enabling the `use_streaming` setting + settings = EnvironmentSettings.new_instance().in_streaming_mode().build() + # where we define our sources + t_env = StreamTableEnvironment.create(env, environment_settings=settings) + + # this is very similar to registering a UDF in Spark. + t_env.create_temporary_function("get_location", get_location) + + # In get_location, is where we get the IP geolocation, from that API service + # Check above the function get_location + # This "UDF" returns a JSON object of country, state and city. + + # You can see it's called a `ScalarFunction`, which means it takes in 1 row or column + # and returns 1 row or column. + # In this case we're taking 1 column (IP address) and returns 1 column (JSON object). + + # ---- + try: + # Here we're creating a couple of tables. A source, and a sink. + # Check these functions in the code + source_table = create_events_source_kafka(t_env) + postgres_sink = create_processed_events_sink_postgres(t_env) +``` + +--- + +Let’s take a look for a moment at the source configuration used to connect to sources and sinks. + +This is for Kafka source. + +```python +WITH ( + 'connector' = 'kafka', + 'properties.bootstrap.servers' = '{os.environ.get('KAFKA_URL')}', + 'topic' = '{os.environ.get('KAFKA_TOPIC')}', + 'properties.group.id' = '{os.environ.get('KAFKA_GROUP')}', + 'properties.security.protocol' = 'SASL_SSL', + 'properties.sasl.mechanism' = 'PLAIN', + 'properties.sasl.jaas.config' = 'org.apache.flink.kafka.shaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{kafka_key}\" password=\"{kafka_secret}\";', + 'scan.startup.mode' = 'latest-offset', + 'properties.auto.offset.reset' = 'latest', + 'format' = 'json' +); +``` + +- connector → specify where you’re reading from. For instance, you can even specify `rabbitmq`, or `jdbc`, so you can read from postgres, etc… In short, Flink can read from anywhere +- bootstrap.servers → all the servers that are running Kafka, a cluster of servers. +- topic → You can think a Kafka topic is very similar to a database table. If you look in the Flink environment, our Kafka topic is `bootcamp-events`. +- group.id → You can think of this like a schema. A lot of times in the DB world, you have `prod.table_name`, `dev.table_name`, this is kinda the same idea. +- properties.security.protocol → the protocol we using to connect to Kafka +- properties.sasl… → probably something specific to SASL security protocol, Zach didn’t treat this in the video because he had slightly different code than the one published. +- scan.startup.mode → remember when we discussed earliest offset and latest offset in the lecture. What this does is, when you first kickoff your Flink job, it’s going to read from either the 1st record in Kafka or the last one. +- properties.auto.offset.reset → This is if it fails and restarts, do we read from the first or the last offset? (or even a checkpoint, but checkpointing is optional) +- format → how is this data stored? In Kafka this is stored as JSON, but this can also be CSV, TSV, etc… + +Then, we execute the source table DDL like this `t_env.execute_sql(source_ddl)`, basically to get access to the Kafka queue. [I’m not exactly sure what’s going on here. Is the “table” on Flink? I guess it is. — Ed.] + +We defined our source, Kafka. Now let’s look at the configuration for the postgres sink. Here the schema is different from the source. + +```python +sink_ddl = f""" + CREATE TABLE {table_name} ( + ip VARCHAR, + event_timestamp TIMESTAMP(3), + referrer VARCHAR, + host VARCHAR, + url VARCHAR, + geodata VARCHAR + ) WITH ( + 'connector' = 'jdbc', + 'url' = '{os.environ.get("POSTGRES_URL")}', + 'table-name' = '{table_name}', + 'username' = '{os.environ.get("POSTGRES_USER", "postgres")}', + 'password' = '{os.environ.get("POSTGRES_PASSWORD", "postgres")}', + 'driver' = 'org.postgresql.Driver' + ); +""" +``` + +- connector → here is `jdbc` because we connecting to postgres +- url → obviously the URL of our postgres environment +- table-name → ‘processed_events’, what we created before +- username → the postgres username +- password → the postgres password +- driver → this must be included when you deploy Flink if you want to be able to connect to postgres otherwise it will fail. + +> Keep in mind, these `CREATE TABLE` statements don’t actually create a table in Postgres. This is just so Flink can be aware of the schema. That’s why we actually had to run the `CREATE TABLE` DDL at the beginning of the lab. +> + +Last thing, the Kafka sink. It’s not ran in the main code but the DDL remained, so let’s take a look at it. + +```python +sink_ddl = f""" + CREATE TABLE {table_name} ( + ip VARCHAR, + event_timestamp VARCHAR, + referrer VARCHAR, + host VARCHAR, + url VARCHAR, + geodata VARCHAR + ) WITH ( + 'connector' = 'kafka', + 'properties.bootstrap.servers' = '{os.environ.get('KAFKA_URL')}', + 'topic' = '{os.environ.get('KAFKA_GROUP').split('.')[0] + '.' + table_name}', + 'properties.ssl.endpoint.identification.algorithm' = '', + 'properties.group.id' = '{os.environ.get('KAFKA_GROUP')}', + 'properties.security.protocol' = 'SASL_SSL', + 'properties.sasl.jaas.config' = '{sasl_config}', + 'format' = 'json' + ); + """ +``` + +This is kind of the same as the source data, because it’s the same cluster. The only difference here is the topic, because we are writing to another table/topic. + +> One thing that’s cool about Flink SQL is that is has the same performance as Flink Java, but you can write SQL. Example, the “tables” we created for sources and sinks, can then be referenced just by their name like you would do in SQL. In fact, for each of them, the creating functions just return the table names. We will see this right now, in the transformation step. +> + +--- + +Back to the main code, where we do some transformations: + +```python +t_env.execute_sql( + f""" + INSERT INTO {postgres_sink} # <-- table name! + SELECT + ip, + event_timestamp, + referrer, + host, + url, + get_location(ip) as geodata # <-- the UDF we created to get ip geolocation! + FROM {source_table} + """ +).wait() +``` + +This is very simple, there’s probably nothing to explain. + +The `.wait()` is to make it run continuously instead of just running once for every record in the Kafka queue and then ending the job. + +The try-catch is in case there’s a failure on either side. + +--- + +Let’s now kill the Flink processes with `make down` and restart them with `make up`. If you now visit `localhost:8081`, and go to Jobs → Running Jobs, you will see there’s no running job ‘cause we just started the task manager. + +Now click on “Overview”. + +- Available task slots → how many jobs you can run at once. If you have more, is gonna wait for one to finish (’cause Flink jobs can also be batch jobs). + +Now run `make job` in the terminal to restart the job. + +Flink is similar to Spark in a certain way, as Spark is lazily evaluated and only starts a job when there’s an `insert` statement, or a `.collect()` or something that moves data. Flink is similar, which is why if we were running both sinks (Kafka and Postgres) we would see two jobs running. + +As a fun experiment, kill the jobs again and change the offset to `earliest`, then restart flink and the job (`make up` and `make job`). This will create some interesting things in the UI. + +In the Flink UI, click Jobs → Running Jobs, then open the details of the running job. The first thing you’ll notice is that the color of the box is no longer blue, but fully red (give it a couple seconds). + +![image.png](images/d1la_image.png) + +Also, the **“busy”** level is now at 100% [in previous experiments with latest offset it was running around 16% — Ed.]. + +- **Busy** → it means we’re at MAX compute, i.e. this job cannot process any more data and if it processes more data [not exactly sure how that would work — Ed.] it gets moved into **“backpressured”.** +- **Backpressured** → this essentially tells Kafka “slow down, add a bit more latency to the data, so we can process all of it and we don’t go OOM”. This can happen if your dataset goes very viral, or you get a lot of hits and spike. + +Sure, you can enable some parallelism, but you have to pick the number of machines ahead of time, and you don’t know how much data you’re gonna have. + +One tricky aspect of this pipeline is that we’re using a postgres sink, which contrarily to kafka sink is not tuned for high throughput, so in general you want to be aware when you’re building your Flink jobs about where your data is going. + +👆 This is also the reason you’re seeing the job busy for so long. Because writing to postgres is going to take a while. If you were using a Kafka sink instead, you would see it get in the “blue” probably much quicker. diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 1 - Lecture.md b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 1 - Lecture.md new file mode 100644 index 00000000..74ee40df --- /dev/null +++ b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 1 - Lecture.md @@ -0,0 +1,267 @@ +# Day 1 - Lecture + +# Intro + +In this course we will cover + +- Apache Flink +- Apache Kafka +- how to process data in real time +- how to connect to an Apache Kafka cluster in the cloud that’s attached to Data Expert Academy +- working with clickstream data in the labs +- working with the complex windowing functions available in Flink like **count windows**, **sliding windows** and **session windows** + +# What is a streaming pipeline? + +Streaming pipelines process data in a low-latency way. In this lecture we will talk about different types of low latency pipelines. + +Usually **“low-latency”**, means the data is processed in the order of minutes, depending also on the windowing functions, between couple minutes to half an hour. + +Another word Zach likes to use is **“intraday”**, ‘cause a lot pipelines in the business world are daily pipelines, i.e. once a day, kicking off at UTC 00:00, and then run for a couple hours, then they’re done for the day. + +Intraday pipelines run more than once a day, and that’s where this gets complicated, and we’re gonna talk about the different ways this can happen. + +# What’s the difference between streaming, near real-time and real-time? + +- Streaming (or continuous) + - Data is processed immediately as it is generated → processed like a river, a **stream** + - Example: Flink +- Near real-time + - Instead of processing data as is generated, you wait maybe a couple of minutes, to collect all data that has happened in that period of time, and then you process that data as it comes in. + - Example: **Spark Structured Streaming** (not yet a continuous processing framework). +- Real-time + - Often synonym with **streaming**, but not always! + - You wanna make sure to understand latency requirements of the pipeline + - Don’t take your stakeholder words at face value (e.g. “I need it real time!”) + - Zach says, in his career, only two times a stakeholder actually meant a streaming pipeline when asking for **real-time**, so like 10% of the times in his case. + - **As a rule of thumb: streaming pipelines use cases are quite rare!** +- Micro-batch + - Almost the same as **near real-time** + +## What does real-time mean from stakeholders + +- It **RARELY** means streaming +- It usually means low-latency or predictable refresh rate + +> Sometimes Zach had stakeholders tell him “real-time”, but what they really meant is *“every day at 9 AM”* +> + +Lot of times, what stakeholders mistake is that they think real-time means “reliable”, “predictable”. This is good to talk about with them: you can have a Service Level Agreement (SLA) discussion, which essentially means you ask them *“When do you want the data refreshed?”* and a lot of times the will give you a time like “8 hours after midnight”, and then you can agree with them that you guarantee the data will be available by then. + +This conversation will be way better than not having this conversation, and just do everything in streaming. + +This is a proper example how good communication is fundamental in Data Engineering. + + + +# Should you use streaming? + +Is it even worth to use it? + +You need to ask yourself these questions: + +- Do members on the team have a streaming skillset? + - When Zach was working at Netflix, most pipelines in the security space were done in batch or microbatch. The people in security wanted lower latency, even if they were giving already 15 minutes refresh times. + - What happened was the security team kept pushing, so one guy in the DE team started migrating some of the pipelines to streaming, and what happened is that this guy became an island: when one of the streaming jobs broke, he was the only one who could fix it. Zach had to learn how to fix it, but they were only 2 out of 15, which was not really fair, also ‘cause they had to be on-call half the year each. +- What’s the incremental benefit? + - In the Netflix example, let’s say they lowered the latency from 15 mins to 2 or 3 minutes, what is the real benefit gained? In the security space maybe you can make a case for it, but in other cases, you have to consider what the value added is of reducing latency. +- Homogeneity of your pipelines + - If you’re a DE team, and you have 99% batch pipelines, then why have 1% streaming pipelines? + - The only case this makes sense, is if you have already people with the skills, and you have a STRONG incremental benefit to it, then maybe. + - Otherwise, if you’re a batch team, you should stay with batch, and if you’re a streaming team, you should stick with streaming (e.g. Uber and the Kappa architecture). +- Tradeoff between daily batch, hourly batch, microbatch and streaming. + - There’s a continuum between complexity vs latency + - Again you want to go back to what the incremental benefit is +- Where are you gonna put data quality checks? + - ‘Cause this is a lot harder to do in streaming (batch pipelines have easier DQ paths). + - In batch you have obvious steps: A→B→C. You can put quality anywhere between these clear cut steps. + - But in streaming, where do you put it? It just runs all the time, there’s no A→B→C, it’s always on. + +# Streaming-only use cases + +> **KEY: low latency makes or breaks the use case** +> + +This means those cases where hourly batches for instance won’t cut it, and they only make sense in a streaming situation. + +Examples: + +- Detecting fraud, preventing bad behavior + - Stealing credit card: you don’t want them to be able to go on a 1-day spending spree before the fraud is detected → it would be quite a bad fraud detection system +- High-frequency trading + - Imagine you’re like “I found a perfect setup for a trade, yesterday”, ‘cause the data is in daily batch. Well, that opportunity is gone! +- Live event processing + - Like sports analytics during a live game, those all come from streaming pipelines. + +> For the streaming-only use cases, there’s an obvious signal that we should be using streaming, because without it would break the product. +> + +## Gray-area use cases (microbatch may work too) + +- Data that is served to customers + - What you want to be careful about is the tradeoff between data quality and latency. So streaming might not a good fit, but microbatch might, as it’s easier to implement DQ. + - Customers might be ok having 1 hour latency. + - What’s bad is if you’re many days behind. +- Reducing the latency of upstream master data + - Imagine you have master data that is upstream in your DWH, one of the furthest upstream dataset that’s depended on by a lot of other datasets. In this case here streaming could be very powerful, because all the downstream datasets can fire sooner, because the daily data will be ready sooner. + - This allows your warehouse to also be amortized, as you use all the compute throughout the day, as supposed to having some dead compute in the middle of the day, and at midnight having a big spike, which happens a lot in many different warehouses, especially in those hours between 0:00AM and 2:00AM. + - This makes even more sense if you’re not using on-demand resources but you’re just renting. + +Here’s an example from Zach where he failed to implement streaming: he was working in notifications, and he needed to work on this notification event fact data. + +Every row was a notification event: sent, delivered, clicked on etc… + +The thing is that these events can be duplicated because you can click on a notification twice. So this was a perfect use case for streaming, and no matter what Zach did it kept going OOM. + +The reason for that is that the duplicate can happen at any point throughout the day, so even if one event happens in the morning and the other at night, those duplicates still need to be managed, so that means you have to hold on every **notification_id** for the entire day, everything in RAM. And that dataset was 20 TB. + +In that case, what Zach ended up doing was eventually abandoning streaming and did it in micro-batches. [Here’s the example repo](https://github.com/EcZachly/microbatch-hourly-deduped-tutorial) (it’s the tree deduping microbatch approach). + +This made all the notifications datasets way more up-to-date, form 9 hours latency to about 1 hour. + +# No-go streaming use cases! (use batch please) + +If the analyst is just upset that the data is up-to-date in those cases where he’s querying early in the morning. Let it be, it doesn’t matter. You need to ask yourself: ok if it was up to date, what would change in the business? **Most times, nothing, so it would be a waste of time to reduce the latency.** + +If you think about the velocity at which business decisions are made at, most business decisions take hours, it never happens that a business sees the data and goes “omg! we need to act now!”. + +> So you need to ask yourself, even if data was available right from the moment it gets generated, how is the business gonna change? +> + +Some cases, real-time data it might even be detrimental, as people might just get stuck on the screen watching the numbers update continuously, like traders addicted to the stock market in the minute-to-minute charts. + +# How are streaming pipelines different from batch pipelines? + +- Streaming pipelines run 24/7! Batch pipelines run for a small % of the day, like 1 to 4 hours or so, where 4 hours is for those giant behemoth big tech pipelines. + - A pipeline running all the time has more probability of breaking, rather than something running just a few minutes a day. + - Thus require a very different skillset → Streaming pipelines act as a webserver, serving and moving data. +- 👆 Therefore, streaming pipelines are much more software engineering oriented + - They act a lot more like servers than DAGs +- Streaming pipelines need to be treated as such and have more unit test and integration test coverage like servers! + +# The streaming → batch continuum + +![image.png](images/d1le_image.png) + +This continuum goes from not complex → very complex, and from high latency → low latency. + +The lower the latency, the higher the engineering complexity. + +> As you reduce the latency of your solution, complexity goes up! +> + +--- + +Real-time is actually a myth (even with Flink)!. + +- You’ll have seconds of latency just because there’s network transfer time (generation → Kafka → Flink → sink) + - Example, in the lab today, every time Zach’s website gets a web request, there’s an interceptor that intercepts the web request and logs it into Kafka. The avg. latency for this step is 150ms. + - The above is just a step, but there’s actually many more moving pieces that delay the data! + +> Just remember real-time doesn’t mean instantly. +> + +# The structure of a streaming pipeline + +- The sources + - Kafka, RabbitMQ + - Rabbit doesn’t scale as much as Kafka due to throughput + - But it has more complex routing mechanisms → i.e. Rabbit can be like a message broker; you can do “pub-sub” a lot easier than you can do with Kafka. + - Kafka is more like a firehose, in one direction. You can’t really like divert → “here’s my stream of events now process it!” + - Because of this architecture, Kafka is very fast. +- Enriched dimensional sources (i.e. side inputs) + - Here is where you have events coming in, and you want to have that event bring in some dimensional sources, maybe doing some de-normalization of your fact data, where you bring in different columns from other tables. + - These are called side inputs. Google “Flink side inputs” to learn more about it. + - Data comes from regular data-at-rest, like Iceberg or Postgres + - This dimensional data can refresh on a cadence, like 3-6 hours, you decide. + - Obviously, if your dimensions change a lot, then you want the cadence of refresh to be smaller. + - You don’t wanna refresh too much as you’re gonna waste compute on data that doesn’t change too much. + +--- + +- The compute engine → Two SOTA options: + - Flink → very powerful framework + - Spark Structured Streaming + - These are what actually do the crunching of the data. + +--- + +Here basically what happens is: + +- Your data lands in real-time in Kafka +- Flink picks it up in real-time and dumps it somewhere + +Components: + +- The destination, also called “the sink” +- Common sinks + - Another Kafka topic + - Iceberg → one of the reasons why it was created was to solve this streaming problem. + - Before you had HIVE (metastore was very *batch oriented*) + - If you wanted to add data, you had to override the whole partition + - No way to add new data to a partition that already existed + - Iceberg let people do this 👆(append new data to existing partitions) + - Iceberg tables can be easily populated by streaming, but also queried in batch. + - Postgres + - This is another powerful option to setup a sink + +# Streaming challenges + +- Out of order events +- Late arriving data +- Recovering from failures + +Remember that there’s latency between when data is generated and when it lands in Kafka. + +Because of this latency, you can have newer data that is ahead of older data in the dataset, or in other words, the data is not in the right order. How do you manage these? Especially in cases like funnel analysis, or other event streams that depend on the order of events. + +For late arriving data, compare it with batch world: in the batch world it doesn’t have such a big impact, because you only have to worry about it at midnight. A lot of the times, the batch job doesn’t even fire until 10-15 minutes after 00:00AM. You don’t really have to worry about late arriving data here, unless it’s really really late, but in that case, most people would go “eh fuck it, who cares”. + +Regarding failure recovery, in streaming pipelines it’s tricky when it fails, because you need to re-set it and have it run again. And the longer you wait to fix a streaming pipeline failure, the more data gets behind it, getting backed up more and more. + +> Most of these problems don’t exist in the batch environment. +> + +## Out of order events + +How does Flink deal with out-of-order events? + +- You can specify a **WATERMARK** → what watermark does is say “there’s no events that are newer than the watermark”. In other words, when the watermark is hit, you can guarantee that all events inside that watermark are gonna be older than the events after the watermark. +- How it works, is the watermark looks at the event time, and usually there’s some buffer (like 15 seconds). So the watermark says “ok, everything within the next 15 seconds could be out-of-order, but something 16 seconds away is NOT out of order”. +- Basically you give yourself a window of X seconds where there’s a possibility that your events will be out of order, and Flink will fix the ordering in that window automatically. + +## Recovering from failures + +Flink manages in different ways: + +The big one is **CHECKPOINTS.** You can Flink to checkpoint every `n` number of seconds, and what this does is to save the state of the job at that moment in time, knowing where to read from and write to if it fails. + +This leads to talk about **offsets.** When Flink starts up, you have to tell it whether to do either: + +- **earliest offset** → read everything that’s in Kafka, all the way back +- **latest offset** → only read new data after the job starts +- A specific moment in time → read at this timestamp or newer. + +Or it can pick up from a checkpoint / savepoint. + +- Checkpoints → internal to Flink internal mechanisms +- Savepoints → More agnostic, kinda like a CSV file that says “ok we processed this data and we got this far”. Sometimes if Flink fails, we want other systems to be aware where we failed and where to pick up from. + +## Late-arriving data + +How late is too late? + +You actually have to choose the time here. 5 minutes? 10 minutes? + +> Late arriving data and out of order are related, because out of order is data that arrived late. +> + +Watermarking and late-arriving data are similar concepts, but: + +- Watermarking is for the 99% of data +- Late arriving is for the long tail, the small amount of data that might come in exceptionally late diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 2 - Lab.md b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 2 - Lab.md new file mode 100644 index 00000000..00cbffc2 --- /dev/null +++ b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 2 - Lab.md @@ -0,0 +1,102 @@ +# Day 2 - Lab + +In this lab we will work with aggregations over window functions in Flink. First, let’s start up Flink with `make up`. + +Let’s start by creating the necessary tables in Postgres: + +```sql +-- to count hits on host every 5 minutes +CREATE TABLE processed_events_aggregated ( + event_hour TIMESTAMP(3), + host VARCHAR, + num_hits BIGINT +); + +-- to count hits on host for each referrer every 5 minutes +CREATE TABLE processed_events_aggregated_source ( + event_hour TIMESTAMP(3), + host VARCHAR, + referrer VARCHAR, + num_hits BIGINT +); +``` + +Now open `aggregation_job.py`. Like with the other job, let’s go step by step. + +```python +def log_aggregation(): + # Set up the execution environment + env = StreamExecutionEnvironment.get_execution_environment() + env.enable_checkpointing(10) + env.set_parallelism(3) + + # Set up the table environment + settings = EnvironmentSettings.new_instance().in_streaming_mode().build() + t_env = StreamTableEnvironment.create(env, environment_settings=settings) + + # these first settings are standard that you're going to see + # in pretty much every single Flink job + + try: + # Create Kafka table + # Here we create our source table, which was created from that other job + source_table = create_processed_events_source_kafka(t_env) + + # and here we define our sinks + aggregated_table = create_aggregated_events_sink_postgres(t_env) + aggregated_sink_table = create_aggregated_events_referrer_sink_postgres(t_env) + + # Here we're gonna have a window that is open every 5 minutes (see the tumble) + # And basically counts, for every 5 minutes, how many hits each host gets (group by) + # and returns 3 columns: the initial time of the window, the host, and the hit count. + t_env.from_path(source_table)\ # read from source table + .window( # do the 5 minutes window on a time column + Tumble.over(lit(5).minutes).on(col("window_timestamp")).alias("w") + ).group_by( # group by the window and other dimensions + col("w"), + col("host") + ) \ + .select( # return the columns we need + col("w").start.alias("event_hour"), + col("host"), + col("host").count.alias("num_hits") # this is the aggregation + ) \ + .execute_insert(aggregated_table) + # in other words, this 👆 is gonna do an aggregation every 5 minutes + # Another note: you most likely wanna do your windowing on event time, + # rather than processing time!!! + + # The other job is pretty much exactly the same, + # except it adds another dimension (referrer) to the aggregation. +``` + +> Keep in mind that when using window functions in Flink, your source needs to have watermarks for them to work, as seen in the function that defines the source `create_processed_events_source_kafka`. *[This is at least according to how Zach debugged an issue with the pipeline during the video lesson. — Ed.]* +> + +Now, in your terminal, run `make aggregation_job` to start these new jobs. You need to keep it running for at least 5 minutes to see some data on postgres. When you do, data for the first job should look like this: + +![image.png](images/d2la_image.png) + +Whereas data for the 2nd should look like this. It’s basically the same but grouped by also referrer. + +![image.png](images/d2la_image%201.png) + +--- + +One thing that’s important about Flink is working with parallelism. If you check the Flink UI, you see that any of the running job has a parallelism of 3 and also 3 tasks for stage. Parallelism is correlated to how data is partitioned at the source, so if you have 3 partitions, a parallelism of 3 will give 3 times the speed of no parallelism. + +Another interesting aspect is that parallelism numbers in Flink are much smaller than Spark. In Spark, the default partitioning is 200, whereas in Flink it’s 1. + +That’s one of the fundamental differences between batch and streaming: In streaming you process the data as it comes in, but then your job is running all the time, and you’re not going to all the time have spikes that require a lot of parallelism. So it’s kind of a trade-off. + +**Do we need to do the aggregation in real time? And should it be Flink, or Spark, or batch?** + +This depends on the situation. In this case, it’s nice to use Flink because the aggregation window is only of 5 minutes. Running a Spark job that does GROUP BY every 5 minutes would be an absolute joke, since it would take like 3 minutes every time to start the job. + +There’s a threshold where it flips, where batch becomes preferable to streaming, which is probably around 1 hour, in that you don’t want to do a tumble window that’s as large as 1 hour. + +Obviously this rule of thumb only matters when you’re working at a very large scale. + +At a smaller scale, it’s totally possible to use Flink to do daily aggregations, when the data is quite small. + +But this is the reason why batch becomes better on larger timeframes, because Flink needs to hold on so much stuff in memory when doing a windowing, so imagine if you were to do a window of 2 hour with billions of events. diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 2 - Lecture.md b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 2 - Lecture.md new file mode 100644 index 00000000..3e2f1110 --- /dev/null +++ b/bootcamp/materials/4-apache-flink-training/markdown_notes/Day 2 - Lecture.md @@ -0,0 +1,230 @@ +# Day 2 - Lecture + +# Streaming needs a lot of pieces to work + +![image.png](images/d2le_image.png) + +Let’s navigate more in detail what is going on when events from Zach’s website show up in our local postgres instance like we saw in previous lecture and lab. + +This diagram above is the architecture of his entire website. + +Starting from the stick figure: + +1. The http request (to visit the website) gets intercepted +2. It’s logged to Kafka + 1. Goes to a **Kafka Producer** + 2. Kafka Producer puts it on Kafka + 3. Read from Kafka with Flink + 4. Can dump it to either another Kafka topic or our local Postgres +3. It’s also passed forward to [Eczachly.com](http://Eczachly.com) ExpressJS server +4. Check if the request is a webpage request (a normal HTML request) +5. If yes, server side rendering with NextJS + 1. Passes the rendered HTML to client + 2. Client ready for the user to work with it +6. If it’s not a web request, then it’s an API request (update, create, delete) + 1. Request gets sent to Postgres (like a signup) + 2. Goes back to the server + 3. Redirect the user + +The whole point is that streaming needs to have some type of **real-time architecture** like this. It has to have a way to intercept real time events, so there’s more to streaming than just Kafka. + +There has to be another layer, an **event generation layer**, which in this case is the HTTP interceptor, when every time an action is taken by a user it gets dumped to Kafka. + +## HTTP Interceptor + +```jsx +export default function createAPIEventMiddleware(req, res, next) { + const shouldBeLogged = + req.url && !req.hostname.includes('localhost') && !isFileRequest(req); + const event = { + url: req.url, + referrer: req.headers.referer, + user_agent: JSON.stringify(useragent.parse(req.headers['user-agent'])), + headers: JSON.stringify(req.headers), + host: req.headers.host, + ip: req.connection.remoteAddress, + event_time: new Date(), + }; + + if (shouldBeLogged) { + Promise.all([ + sendMessageToKafka({...event}), + ]).then(() => { + next(); + }); + } else { + next(); + } +} +``` + +This sits between the user and when the request hits the server. + +You see it has 3 arguments: + +- req → the request +- res → the response +- next → when it’s called, it passes the request to the server + +What happens here, is the 1st line wants to understand if the events should actually be logged. This way, during local development, those events are not sent to Kafka as it’s not real traffic; file requests also don’t get logged (like image requests) as that would be excessive, as we only care about the webpage requests and API requests. + +Then we have the `event`, and that’s the schema we were working with last lab. + +Last step is: if the event should be logged, then pass the event to Kafka and then call `next()`. Otherwise, just call `next()`. + +## Kafka Producer + +```jsx +export default function sendMessageToKafka(message: object) { + const producer = new Kafka.Producer({ + connectionString: process.env.KAFKA_URL, + ssl: {cert: process.env.KAFKA_CLIENT_CERT...}, + }); + + const messageObject = { + topic: 'porcupine-78436.bootcamp-events', + message: {...}, + } as Message; + + producer + .init() + .then(() => { + return producer.send([messageObject]); + }) + .then((result) => { + console.log('Message sent successfully', result); + return producer.end(); + }) + .catch((err) => { + console.error('Error sending message', err); + return producer.end(); + }) +} +``` + +This is the producer code, that in the Interceptor sends the message to Kafka. In the Producer we need to also specify the SSL connection like we had to to for Flink. + +The `messageObject` is composed of both the topic to send the message to, as well as the payload. + +Eventually, with the producer we try to send the data (`producer.send`), and catch any error that may happen. + +Since this is just logging to Kafka, we don’t want this to throw an error, because it’s just logs anyway. + +> Interceptor and Producer are the two things that happen upstream to events landing in Kafka. +> + +# Future version of the architecture + +![image.png](images/d2le_image%201.png) + +It’s similar to the previous one, with a couple more pieces. + +We have the Flink job that outputs to another Kafka topic, and then a Kafka consumer listens to the processed Kafka queue, which will then push events back to [EcZachly.com](http://EcZachly.com) server, and then the server will push those events through a Websocket, and then the client will be updated in real-time. + +This is what a real-time data product looks like: you have a full loop where the clients generate some data, which gets processed by the server somehow and fed back live to the client. + +Let’s talk a little more about websockets: + +Generally speaking, when you’re building a website, you have a client and a server, and the client always asks the server for data, and it almost always initiates the request. This is basically how HTTP works. + +However, if you need the server to also send requests to the client, then you need a **websocket,** which enables a two-way communication and real-time updates. + +# The big competing architectures + +For processing data, there are two competing alternatives. + +- Lambda architecture +- Kappa architecture + +**Lambda architecture** + +Imagine you have a batch pipeline that you want to optimize for latency, so you create a second one that is a streaming pipeline and lands data earlier so the data is ready earlier. + +In Lambda architecture, you keep both pipelines, batch and streaming, both writing the same data, and the main reason is that the batch pipeline acts as a backup if the streaming pipeline fails in some way. + +The biggest pain here is double code. + +**Kappa architecture** + +In this one you don’t need both streaming and batch, you can just use streaming. For instance, Flink can handle both streaming and batch loads, however backfilling is quite hard, as Flink is mainly tailored to read from Kafka, and reading many days of Kafka history is quite painful. + +However, with Iceberg things are changing, because you can dump data to Iceberg and have a nice partitioned table, and if you need to backfill with Flink, you can do so days at a time, rather than on a single line of data as it would be in Kafka. + +The poster child of Kappa architecture is Uber, as they’re basically a 100% streaming only company, which makes sense, as it’s very in their nature to be streaming first. + +- Pros + - Least complex + - Great latency wins +- Can be painful when you need to read a lot of history + - You need to read things sequentially +- Delta Lake, Iceberg, Hudi are making this architecture much more viable! + +> In the end, one is not better than the other, but it’s useful to know about both of them and where they would fit. +> + +# Flink UDFs + +UDFs generally speaking won’t perform as well as built-in functions + +- Use UDFs for custom transformations, integrations, etc… +- Python UDFs are going to be even less performant than Java or Scala UDFs, and the reason for this is that they need to have a separate python process that gets called when you hit that UDF call (kind of like it is in PySpark). + +# Flink Windows + +There’s two types of windows in Flink: count windows, and time driven windows. You can kind of think of these as GROUP BY or window functions in SQL. It’s kind of like that, but different. + +**Data-driven windows** + +- Count +- How this window works is you open it, and then it stays open until you see **N** number of events +- You can key here, so you can make them “per user” (e.g. like PARTITION BY in SQL, in Flink you have Key By) +- One important aspect is that the number of events may never come, e.g. you threshold is 50 but a person only does 3, so you want to specify a timeout to close the window. +- Very powerful for funnel analytics that have a predictable number of events. + +**Time-driven windows** + +- Tumbling + - Fixed size + - No overlap + - Similar to hourly data + - Great for chunking data + - The closest comparison to data in batch. A lot of times in the batch world you have hourly or daily data. This is kind of like a window. From point A to B, e.g. midnight to midnight of the day after. + +![image.png](images/d2le_image%202.png) + +- Sliding + - Has overlap + - Captures more windows + - Good for finding “peak-use” windows + - Good at handling “across midnight” exceptions in batch + - These usually have a fixed with, but can be overlapping. If your fixed with is 1hr, you can have from 1:00 to 2:00, but also from 1:30 to 2:30. You get more duplicates this way so you need to understand how to manage them downstream. You use sliding windows probably to find the window with most datapoint, e.g. “peak-time”. If your window is 1 hour long, but you shift it every 30 minutes, you may realize the peak is between 12:30 to 1:30 rather than 12 to 1. It’s also useful when crossing the midnight boundary. If your user starts the session at 11.58 PM and ends at 00:02AM, do they count as daily active on both days? Sliding windows in this case can show you these kind of double counts. However, they’re relatively niche, and have a very specific use case, because of this weird overlap problem, and they’re more suited towards an analytical use case rather than for building out master data. + +![image.png](images/d2le_image%203.png) + +- Session (no graph here unfortunately) + - Variable length → User specific + - Based on activity + - Used to determine “normal” activity + - Imagine when user signs in, they make the first event, that’s the start of the window. Then the window will last until there’s a big enough gap (e.g. 5, 10, 20 minutes), where you have no data from the user. + +# Allowed Lateness vs Watermarking + +Two ways to deal with out-of-order or late arriving events. If it’s a little be late, then **watermarking** is gonna be good (like a couple seconds). If it’s late in the order of minutes, then **“allowed lateness”** works better. + +**Watermarks** + +- Defines when the computational window will execute +- Helps define ordering of events that arrive out-of-order +- Handles idleness too + +**Allowed Lateness** + +- Usually set to 0 +- Allows for reprocessing of events that fall within the late window +- CAUTION: WILL GENERATE/MERGE WITH OTHER RECORDS + +Allowed lateness seem to have several drawbacks, Zach provides a nice metaphor here to understand the difference: imagine you have a Zoom call to attend and you’re just a couple minutes late: you’re not going to text anybody about it. But if you know you’re gonna be more than like 5 minutes late, then you probably will text to warn about it. + +Zach reasons the same about events. He only uses **watermarks** most of the time, and if an event is too late (like a person being too late in a Zoom call), he assumes they’re not coming and discards them. + +Obviously, it depends on the use case, as in some you need to capture 100% of the data even if it is really late, but that’s probably not the case for most pipelines. diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d1la_image.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d1la_image.png new file mode 100644 index 00000000..42d60976 Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d1la_image.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d1le_image.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d1le_image.png new file mode 100644 index 00000000..7bc3c9d9 Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d1le_image.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2la_image 1.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2la_image 1.png new file mode 100644 index 00000000..4c7a9533 Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2la_image 1.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2la_image.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2la_image.png new file mode 100644 index 00000000..273fb216 Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2la_image.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 1.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 1.png new file mode 100644 index 00000000..0cb870de Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 1.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 2.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 2.png new file mode 100644 index 00000000..d50b8717 Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 2.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 3.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 3.png new file mode 100644 index 00000000..e0517756 Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image 3.png differ diff --git a/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image.png b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image.png new file mode 100644 index 00000000..9e862d01 Binary files /dev/null and b/bootcamp/materials/4-apache-flink-training/markdown_notes/images/d2le_image.png differ diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 1 - Lab.md b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 1 - Lab.md new file mode 100644 index 00000000..53932057 --- /dev/null +++ b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 1 - Lab.md @@ -0,0 +1,144 @@ +# Day 1 - Lab + +This lab works with the same setup of the first week (the postgres one). + +Let’s start by creating a new table, which we’re going to use also for the survivor analysis. + +```sql +create table users_growth_accounting ( + user_id text, + first_active_date date, + last_active_date date, + daily_active_state text, -- values that are gonna be like "churned, resurrected" etc. + weekly_active_state text, + dates_active date[], -- this is similar to week 2 cumulation + date date, + primary key (user_id, date) +); +``` + +Let’s now start building the cumulative table. The data available is not the same as what Zach has in the video somehow, so the dates we use are different. We start from ‘2023-01-01’. + +```sql +with yesterday as ( + select * + from users_growth_accounting + where date = date('2022-12-31') +), + +today as ( + select + user_id::text, + date_trunc('day', event_time::timestamp) as today_date, + count(1) + from events + where date_trunc('day', event_time::timestamp) = date('2023-01-01') + group by user_id, date_trunc('day', event_time::timestamp) +) + +select + coalesce(t.user_id, y.user_id) as user_id, + coalesce(y.first_active_date, t.today_date) as first_active_date, + -- this one below is intentionally the opposite of the one above: + -- if they were active today, that's the last active date + coalesce(t.today_date, y.last_active_date) as last_active_date, + case + -- active today, not yesterday + when y.user_id is null and t.user_id is not null then 'New' + when y.last_active_date = t.today_date - interval '1 day' then 'Retained' + when y.last_active_date < t.today_date - interval '1 day' then 'Resurrected' + -- y.date is the partition date + when t.today_date is null and y.last_active_date = y.date then 'Churned' + else 'Stale' + end as daily_active_state, + + case when 1 = 1 then 1 end as weekly_active_state, + coalesce(y.dates_active, array[]::date[]) + || case when + t.user_id is not null + then array[t.today_date] + else array[]::date[] + end as date_list, + coalesce(t.today_date, y.date + Interval '1 day') as date +from today t + full outer join yesterday y + on t.user_id = y.user_id; +``` + +Run this query, and since this is the first iteration, you will see that everyone under `daily_active_state` is gonna be “New”, as expected, because this is the first exact day, everyone must be new. Keep in mind that the **starting date** matters a lot in determine new users. + +Now add this extra piece right under `daily_active_state`, and try to understand the various case statements. + +```sql + case + when y.user_id is null then 'New' + -- here, since it's weekly, the user needs to be at any point in the last 7 days + -- basically any day between yesterday and 7 days before + when y.last_active_date > y.date - interval '7 day' then 'Retained' + -- resurrected means they came back after a larger period of time than window + when y.last_active_date < t.today_date - interval '7 day' then 'Resurrected' + -- not active today, and last time active is exactly 7 days ago + -- (cause they churn on this specific day), else they're stale + when t.today_date is null and y.last_active_date = y.date - interval '7 day' + then 'Churned' + else 'Stale' + end as weekly_active_state, +``` + +Now add `insert into users_growth_accounting` to the top of the query, and run it a couple times, bumping the dates every time. Query the new table and take a look at the results. You will see that some will be “**weekly active**”, but they have churned for “**daily active**”. + +Now run it a few more times until you reach at least 9 total cumulations, so until ‘2023-01-09’. + +Then run a query on `users_growth_accounting` filtering on the last date. You will see that some users have “churned” on `weekly_active_state`. These are those people that were last active longer than 1 week before. + +Now let’s run for analysis. Run this query: + +```sql +select + date, + count(1) +from users_growth_accounting +where first_active_date = '2023-01-01' +group by date; +``` + +You will see a result like this: + +![image.png](images/d1la_image.png) + +Basically this is a cohort of 84 users that were active the first time on the 1st of January. + +Now, for the next step, run this: + +```sql +select + date, + count(1), + count(case when daily_active_state in ('Retained', 'Resurrected', 'New') then 1 end) as number_active, + round(1.0 * count(case when daily_active_state in ('Retained', 'Resurrected', 'New') then 1 end) + / count(1), 2) as pct_active +from users_growth_accounting +where first_active_date = '2023-01-01' +group by date +order by date; +``` + +You will basically see that **J-curve** mentioned in the the lecture. The 2 columns we added represent the (remaining) active users on each given day, as well as their percentage over the initial total. + +However, this query works just for one cohort, and is not great for multiple ones. So we need to switch it up a bit. + +```sql +select + date - first_active_date as days_since_first_active, + count(1), + count(case when daily_active_state in ('Retained', 'Resurrected', 'New') then 1 end) as number_active, + round(1.0 * count(case when daily_active_state in ('Retained', 'Resurrected', 'New') then 1 end) + / count(1), 2) as pct_active +from users_growth_accounting +group by date - first_active_date +order by date - first_active_date; +``` + +Now, with this one we can find all cohorts, regardless of **when** they started. You see the 1st one obviously has 100% active percentage, with 508 people. Then it progressively reduces slowly throughout each **“day since first active”.** And now this is actually the J-curve mentioned in the lecture, for all people, not just a single cohort. + +Obviously, you can take the above example and add more columns and group on them, to gain more information about the different cohorts. diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 1 - Lecture.md b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 1 - Lecture.md new file mode 100644 index 00000000..36756901 --- /dev/null +++ b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 1 - Lecture.md @@ -0,0 +1,203 @@ +# Day 1 - Lecture + +# Intro + +Analytical Patterns will save you a lot of time as a data engineer. + + + +There are a couple patterns that if you can see them, then you’ll know exactly what type of pipeline to implement. + +In this lecture we will consider two of these patterns: + +- Growth Accounting → How facebook tracks inflows and outflows of active and inactive users. + - This can be used for any other state change transition tracking + - Closely related to the cumulative table design concepts we’ve seen in weeks 1 and 2 +- Survivor Analysis Pattern → Of all users that signed up today, what % are still active in 30, 60, 90 days? + - The retention number + +--- + +Repeatable analyses are your best friend. The reason is that they allow you to think at a higher level, at a more abstract layer, recognizing that the SQL will write itself once you recognize the higher level abstraction. + +In a way, these abstractions are an extension to SQL. ‘Cause SQL in itself is already an abstraction over some lower level language. If SQL is already an abstraction, we don’t have to be married to this layer, we can go a layer above. + +Common patterns to learn + +- State change tracking (closely connected to SCD) +- Survivorship analysis +- Window based analysis + +Some of these are really closely linked with cumulative table design, some based on it, and don’t work very well without it. + +# Repeatable analyses + +- Reduce cognitive load of thinking about the SQL → You can think about bigger picture layers + - The more important thing is knowing these high level patterns rather than knowing every single SQL function +- Streamline your impact + - Tech will change in the future, but so it’s better to learn to focus on the bigger picture so that no matter the tech, you can still bring your A game, which is gonna be a lot more impactful than just being on top of the bleeding edge of every technology. + + + +# Common patterns + +- Aggregation-based patterns +- Cumulation-based patterns +- Window-based patterns + +95% of pipelines that Zach has ever written in his life are based on just these 3 patterns. + +There’s maybe a 4th pattern that could be called “enrichment-based pattern”, but here we’re already at the master data level, which means we already have all the columns that we need. + +## Aggregation-based patterns + +Probably the simplest patterns. When you’re building a pipeline and you do an aggregation, the keyword is `GROUP BY`. Aggregations are all about grouping by different things and counting things by different dimensions. + +Counting is a very important part of analytics that doesn’t get enough credits, as it’s not sexy or fancy, it’s kinda like the meat and potatoes of analytics. + +- Trend analysis +- Root cause analysis → You can plug in any metric, it would explain the movement of that metric. + - Imagine you have a week-over-week change of +1 million, it would give you the dimensional breakdown on this increase (so like +1.5 million in the US and -500k in India). + - One thing you can do with this is if you see a certain shift in a metric, you can start bringing in other dimensions to understand where this change is coming from (country, gender, age, height, etc…) + - This gives a better picture than just “number down → ☹️☹️☹️”. + +### Aggregation based analyses + +- The most common type of analysis, probably more than 50% of all analyses. +- `GROUP BY` is your friend +- Upstream dataset is often the “daily metrics”. +- Common types of analysis + - Root cause analysis (why is this thing happening) + - Trends + - Composition + +These analyses you shouldn’t be going back to the fact data, even though there’s a strong urge to go there. The problem is fact data should be aggregated along the dimensional line (e.g. user, or listing_id, or device_id etc…). + +You wanna have things already aggregated up because if you go back to fact data, then join all the dimensions, then aggregate the result, it’s gonna be complicated because dimensions aren’t gonna be 1-to-1. + +You want to have some kind of pre-aggregation on the fact data, like “daily data”, on some dimension such as `user_id`, and then join this one with the users table (or whatever table you’re aggregating on) [admittedly this part was a bit hard to understand — Ed.]. + +There are a couple gotchas: when doing aggregation based analyses, you wanna be careful to not bring in too many dimensions. If you start bringing in too many, ultra-specific dimensions and apply all of them at the same time, then basically you just go back to the daily data, as the groups can get as small as a single person (we saw this in a previous lecture). + +Also, when looking at % based metrics, make sure that you’re looking at actual counts. A 100% drop of something might just be a single person’s action (e.g. going from 1 → 0). + +Another gotcha is when doing a long time frame analysis, you don’t want to have too many cuts in your data (e.g. > 90 days). So don’t do a per day analyses, but maybe do it per week or month. A daily grain with a long term analysis will give you a lot of rows, especially if you bring in also another high cardinality dimension [also, it’s very noisy — Ed.]. + +## Cumulation-based patterns + +- State transition tracking +- Retention (also called j curves, or survivorship analysis). + +These patterns are gonna all be based on the cumulative table design that we’ve worked on week 1. + +- **Time is a significantly different dimensions** vs other ones + - Yesterday vs today +- `FULL OUTER JOIN` is your friend here (built on top of cumulative tables) + - You need to keep track where there ISN’T data. That’s another big difference between aggregation-based pattern and this pattern. + - **No data, is data!** → The fact that someone didn’t DO something, we want to keep track of it in cumulation-based (in contrary to aggregation-based). +- Common for these following patterns: + - State change tracking + - Survival analysis (also called retention) + +### Growth accounting + +A special version of state transition tracking. This is where the cumulation part of it will make more sense. + +There’s 5 states the user can be in: + +- New (didn’t exist yesterday, active today) +- Retained (active yesterday, active today) +- Churned (active yesterday, inactive today) +- Resurrected (inactive yesterday, active today) +- Stale (inactive yesterday, inactive today). + +In some patterns you have a 6th state like “deleted” or “deactivated” (active/inactive yesterday, don’t exist today). + +This pattern is very powerful. Take a look at this chart, as a result of it (MAU stands for monthly active users). + +![image.png](images/d1le_image.png) + +You can calculate **growth** as `new + resurrected - churned`, which will give you the growth rate of your business → how many incremental people are coming in (people coming in minus leaving). + +> This is not just specific to growth. +> + +You can do this in many different areas. “Churned”, “resurrected” and “new” are just labels. + +You can use this pattern for more abstract on different things. For instance, Zach was using it when tracking fake accounts, so he would have these labels: + +- New fake account (account that was never fake, now is fake) +- Resurrected → Reclassified fake account (fake account that was “approved” as real person and then classified fake again) +- Churned → Declassified (in this case a good thing, a fake account that is leaving) + +As you can see, there is the same set of states as for user growth. + +Another example at Netflix: they were labeling applications in their microservices as “Risky” or Not risky”, and they wanted to track the flow of this as well. + +One of the things that these types of patterns give you is very good monitoring, e.g. you can monitor the health of your ML models (e.g. fake accounts, or risky microservices etc…). + +Another example at Airbnb: another ML model that was for hosts, based on the probability that they were gonna cancel on you last minute, which is very bad obviously, especially once the customer already flied to destination. They did the same kind of tracking on this model, labeling hosts as “risky” and “not-risky”. This analysis is good for 2 things: + +- You can look at these charts to track the health of the ML model +- But also to track the effectiveness of whatever you’re doing (do your actions have an impact on hosts behavior?) + +### Survivorship analysis and bias + +![image.png](images/d1le_image%201.png) + +In WW2, planes would get into dogfights, then fly home, and they were like “we need to reinforce those areas with all the shots!”. But then a really smart person was like “hold up, those are the planes that survived, so the areas that need to be bolstered are those with NO bullet holes!”. + +The point here is survivorship and how long things survive is an important measurement to have in our analytical patterns. + +**Survivor analysis J-curves** + +![image.png](images/d1le_image%202.png) + +If you think about retention, that is essentially “surviving”. + +Look at the chart, there’s 3 ways that things can go. See at the top everything starts at 100%; the fundamental component of this analysis that we’re starting from a point (like a date) where everyone is on the same page, and then as time progresses, the state will change: some users will stick around, some will leave. + +If your app has is like the grey line, it’s kinda doomed because as you get new users, over time they will keep going away forever, you lack the “stickiness”. + +In the green and orange line instead, you have found a successful app, as there’s a certain % of users that stay for the long term. + +There’s other applications of J-curves beyond user growth: + +![image.png](images/d1le_image%203.png) + +Basically, as you can see from the chart, there’s some kind “state” to be retained over time, which can be pretty much anything, and is checked versus a reference date. + +### Window-based analyses + +- DoD / WoW / MoM / YoY (day, week, month, year) + - Zach likes to think of this like a derivative, a rate of change over time. Using window functions here is great. +- Rolling Sum / Average + - This one is the opposite of the above, like an integral, i.e. the cumulation over a certain period of time. +- Ranking + - This one doesn’t need to be solved by window functions necessarily, but it depends on how complicated the ranking is. + +One of the keyword of these analyses is **rolling**. For **rolling**, ****the syntax is the same every single time + +```sql +function() + over(partition by keys order by sort rows between n preceding and current row) +``` + +Sort here is often by **date**, then **partition by** whatever the dimensional cut you’re doing (user_id, country, etc…), and **“n”** is gonna be the number of rolling days. + +One thing that’s interesting is that thee 2 lines (week-over-week vs rolling sum / average) are in contrast to each other: the first is spiky, where as the 2nd is smoothed out. + +![image.png](images/d1le_image%204.png) + +> Side note: make sure to partition on something if you’re using big data, because if you don’t, your windows are gonna be so huge and it’s gonna cause OOM errors to happen. +> \ No newline at end of file diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 2 - Lab.md b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 2 - Lab.md new file mode 100644 index 00000000..9a9423c7 --- /dev/null +++ b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 2 - Lab.md @@ -0,0 +1,315 @@ +# Day 2 - Lab + +For this lab, we will be working almost exclusively with the `events` table. + +```sql +select * from events; +``` + +The first thing we’ll do is to figure out for every person that goes to sign up page, how many of them actually sign up. + +There’s two URLs that will determine this. + +- `/signup` +- `/api/v1/users` (or `/api/v1/login`) + +[The data we have is different than Zach’s, so we will use `/api/v1/login` for the sake of the exercise — Ed.] + +```sql +select * from events +where url in ('/signup', '/api/v1/login'); +``` + +We want to understand what % of people who reached the signup page, eventually signed up. We’re gonna do this without window functions, we’re going to do with with self joins. + +Let’s get started by deduping and filtering the dataset. + +```sql +with deduped_events as ( + select + user_id, + url, + event_time, + date(event_time) as event_date + from events + where user_id is not null + and url in ('/signup', '/api/v1/login') + group by user_id, url, event_time, date(event_time) +) + +select * +from deduped_events; +``` + +What we want to do now is say *“did this user, who visited a sign up page, did they ever sign up after they visited the sign up page”?* + +We join the table on itself, so the visiting of the sign up page and the sign up event are on the same row. + +```sql +-- [..] + +select + d1.user_id, + d1.url, + d2.url as destination_url, + d1.event_time, + d2.event_time +from deduped_events d1 + join deduped_events d2 + on d1.user_id = d2.user_id + and d1.event_date = d2.event_date + and d2.event_time > d1.event_time +where d1.url = '/signup' +and d2.url = '/api/v1/login'; +``` + +[The results of the queries are janky because we don’t have the same data as Zach — Ed.] + +This shows all users who did signup before, and later logged in, in one single row. + +Let’s now expand the above query + +```sql +-- [..] + +selfjoined as ( + select + d1.user_id, + d1.url, + d2.url as destination_url, + d1.event_time, + d2.event_time + from deduped_events d1 + join deduped_events d2 + on d1.user_id = d2.user_id + and d1.event_date = d2.event_date + and d2.event_time > d1.event_time + where d1.url = '/signup' +) + +select + user_id, + MAX(case when destination_url = '/api/v1/login' then 1 else 0 end) as converted +from selfjoined +group by user_id; +``` + +This one will now show the users who visited the signup page and converted, as well as the users who visited the signup page **but didn’t convert.** + +Aggregating once more, we can see the totals of this, and the global conversion rate: + +```sql +-- [..] + +userlevel as ( + select + user_id, + MAX(case when destination_url = '/api/v1/login' then 1 else 0 end) as converted + from selfjoined + group by user_id +) + +select + count(1) as total_users, + sum(converted) total_converted, + round(sum(converted) * 1.0 / count(1), 2) as conversion_rate +from userlevel; +``` + +Let’s now finish the query by adding extra information, including all pages above a certain number of hits (to prevent nonsense), and removing the very specific filters on just `signup` and `api/v1/login` (note the commented out rows). + +```sql +with deduped_events as ( + select + user_id, + url, + event_time, + date(event_time) as event_date + from events + where user_id is not null +-- and url in ('/signup', '/api/v1/login') + group by user_id, url, event_time, date(event_time) +), + +selfjoined as ( + select + d1.user_id, + d1.url, + d2.url as destination_url, + d1.event_time, + d2.event_time + from deduped_events d1 + join deduped_events d2 + on d1.user_id = d2.user_id + and d1.event_date = d2.event_date + and d2.event_time > d1.event_time +-- where d1.url = '/signup' +), + +userlevel as ( + select + url, + count(1) as number_of_hits, + user_id, + MAX(case when destination_url = '/api/v1/login' then 1 else 0 end) as converted + from selfjoined + group by user_id, url +) + +select + url, + sum(number_of_hits) as num_hits, + sum(converted) as num_converted, + round(sum(converted) * 1.0 / sum(number_of_hits), 2) as conversion_rate +from userlevel +group by url +having sum(number_of_hits) > 500; +``` + +One issue with this query, w.r.t. the previous one, is that whereas before the hits on a certain page per user where counted only once, now they’re double counted all the time. + +[The way Zach fixes this issue is dubious to me imho, but he was rushing a bit so I didn’t want to include it in the notes in detail. It’s not that much relevant anyway, as the goal here is to understand the analytical patterns rather than specific SQL gymnastics — Ed.] + +Anyway, this is the idea behind funnel analysis. You have two events and you want to see “this happens after that”. + +--- + +Let’s now work with another query to work with `grouping sets`: + +```sql +with events_augmented as ( + select + coalesce(d.os_type, 'unknown') as os_type, + coalesce(d.device_type, 'unknown') as device_type, + coalesce(d.browser_type, 'unknown') as browser_type, + url, + user_id + from events e + join devices d on e.device_id = d.device_id +) + +select * from events_augmented; +``` + +What we’re going to do now is looking at website events, and see what type of device is the one that’s most common. + +The naive way to do this, is to just run: + +```sql +-- [..] + +select + os_type, + device_type, + browser_type, + count(1) +from events_augmented +group by os_type, device_type, browser_type; +``` + +But what if we’re interested in different kind of slices of this data? + +This is where grouping sets come in. + +> Remember that you have to put all grouping columns in the grouping sets at least once, otherwise the query will fail. +> + +```sql +-- [..] + +select + os_type, + device_type, + browser_type, + count(1) +from events_augmented +group by grouping sets ( + (browser_type, device_type, os_type), + (browser_type), + (os_type), + (device_type) +); +``` + +Run the above query and sort by `count` descending. You see the nulls? That’s because of grouping sets, or in other words, it’s because they have been ignored in that specific grouping, for that particular row. + +To give it more meaning, you can do something like this: + +```sql +-- [..] + +select + coalesce(os_type, '(overall)') as os_type, + coalesce(device_type, '(overall)') as device_type, + coalesce(browser_type, '(overall)') as browser_type, + count(1) +-- [..] +``` + +So this will give you the indication that when the value says `(overall)`, it means that column for that row has all values included in that group. + +E.g. + +| os_type | device_type | browser_type | count | +| --- | --- | --- | --- | +| Mac OS X | (overall) | (overall) | 3304 | + +means that all Mac OS X users, regardless of `device_type` and `browser_type`, in total accumulated 3304 events. + +Now try adding these lines + +```sql +select + grouping(os_type), + grouping(device_type), + grouping(browser_type), +-- [..] +``` + +You will see that the 1st 3 columns now are a bunch of 0s and 1s. What it means is that if the value is 0, the column is being grouped upon, and if the value is 1, it means the column is not being grouped by. + +Let’s now build a new column for this table, that uses these groupings, to understand which grouping set is being applied for each row. + +```sql +-- [..] + case + when grouping(os_type) = 0 + and grouping(device_type) = 0 + and grouping(browser_type) = 0 + then 'os_type__device_type__browser_type' + when grouping(os_type) = 0 then 'os_type' + when grouping(device_type) = 0 then 'device_type' + when grouping(browser_type) = 0 then 'browser_type' + end as aggregation_level, +-- [..] +``` + +A probably smarter way to do the above, especially in the case of several combinations, is to write it like this: + +```sql + array_to_string(array[ + case when grouping(os_type) = 0 then 'os_type' end, + case when grouping(device_type) = 0 then 'device_type' end, + case when grouping(browser_type) = 0 then 'browser_type' end + ], '__') as aggregation_level +``` + +This way, you can get any kind of `aggregation_level` automatically, without having to write them all out explicitly. + +Let’s now create a table out of all this query + +```sql +create table device_hits_dashboard as ( +-- [..] +); +``` + +Now you can run queries like + +```sql +select * +from device_hits_dashboard +where aggregation_level = 'device_type'; +``` + +which means in your dashboards you don’t have to query event data, as it’s pre-aggregated, and **it can be powered based on `WHERE` conditions rather than `GROUP BY` conditions**. You even have the largest aggregation level like `os_type__device_type__browser_type`, which shows the results as if we were grouping for all 3 columns at once. diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 2 - Lecture.md b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 2 - Lecture.md new file mode 100644 index 00000000..d262ebeb --- /dev/null +++ b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/Day 2 - Lecture.md @@ -0,0 +1,126 @@ +# Day 2 - Lecture + +# Intro + +In this lecture we will talk about data engineering design patterns used at Meta. + +Sometimes, you’ll see things in DE SQL interviews that you will almost never do on the job, like: + +- Rewrite the query without window functions +- Write a query that leverages recursive CTEs +- Using correlated subqueries in any capacity + +The long story short is that what you get asked in SQL interviews vs what you will have to do on the job often can different, and the job has a much more pragmatic approach. + +However, there are some things about DE interviews that are right to ask: + +- Care about the number of table scans + - `COUNT(CASE WHEN)` is a very powerful combo for interviews and on the job + - Cumulative table design minimizes table scans +- Write clean SQL code + - CTEs are your friend + - Use aliases + +> Also, if you understand how ASTs are generated with SQL, the odds of writing bad performing queries kind of go away. +> + +# Advanced SQL techniques to try out + +- GROUPING SETS / GROUP BY CUBE / GROUP BY ROLLUP + - This is essentially a way to do multiple aggregations in one query without having to do nasty unions. E.g. you can GROUP BY `gender` AND `country`, but then just `gender` and then just `country`, and then also `overall`. +- Self-joins + - In the table, we will use self-joins to create a funnel +- Window functions + - Lag, Lead, Rows clause + - Can calculate stuff like rolling averages and stuff +- CROSS JOIN UNNEST / LATERAL VIEW EXPLODE + - `UNNEST` is how you can turn an array column back into rows, essentially it explodes the array + - UNNEST is same as LATERAL VIEW EXPLODE, it depends on the query engine + +## Grouping sets + +```sql +FROM events_augmented +GROUP BY GROUPING SETS ( + (os_type, device_type, browser_type), + (os_type, device_type), + (os_type), + (browser_type) +) +``` + +Grouping sets are the most complicated. This is like doing 4 queries / aggregations at once. + +The way you would do this without grouping sets, you’d need to copy the query 4 times, and then need to put dummy stuff in the values not being grouped, and then UNION ALL everything. + +With `GROUPING SETS` you gain both in performance and readability. + +One thing that’s important to do when using grouping sets, is to **make sure** that these columns are **never NULL.** Thats because they get already “nullified” when they’re excluded from the group bys, and if you already have nulls, you don’t know which is which. + +> A best practice here is, before doing any of these grouping patterns, you want to **COALESCE** all the grouping dimensions to things like **“unknown”**. +> + +## Cube + +```sql +FROM events_augmented +GROUP BY CUBE(os_type, device_type, browser_type) +``` + +What `CUBE` does is it gives you all possible permutations here. In case of 3, it’s total of 8 possible combinations: with 3 cols, with 2 cols, with 1 col and with 0 cols. + +Don’t use cube with more than 3-4 total dimensions because that explodes into so many different combinations. + +Also, another problem with CUBE is that it does too much. It can even give you combinations that you don’t care about, and waste compute time on them. + +## Rollup + +```sql +FROM events_augmented +GROUP BY ROLLUP(os_type, device_type, browser_type) +``` + +You use `ROLLUP` for hierarchical data (imagine like country, state, then city). + +In `ROLLUP`, the number of dimensions is equal to the number of aggregations you get. In the country example, you’d get a group by country, then country and state, then country and state and city. + +# Window functions + +- Very important part of analytics. + +The function is composed of two pieces: + +- The function (usually RANK, SUM, AVG, DENSE_RANK, …) - *[Zach’s comment on `RANK`, never use it because it skips values. In most cases he either uses `DENSE_RANK` or `ROW_NUMBER`. — Ed.]* +- The window + - PARTITION BY → How you cut up your window (basically like GROUP BY) + - ORDER BY → This is how you sort the window (great for ranking or rolling sums etc…) + - ROWS → Determines how big the window can be (can be all rows, or less) + - `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` → this is the default. + - Window functions basically always look back. + +# Data modeling vs advanced SQL + +If your data analysts need to do SQL gymnastics to solve their analytics problems, you’re doing a bad job as a data engineer. Obviously this is not ALWAYS the case, sometimes they will do crazy queries because that’s what’s necessary on that specific moment. + +The point is for you to remove as much complexity as you can. + +Don’t make the assumption that your analysts are as proficient with SQL as you are. Don’t give them garbage! + +Obviously, focusing on data modeling and data quality makes some of these problems disappear! + +Understanding what the analysts are doing, querying and presenting, just to understand where the bottlenecks are and how to remove them and speed up their processes. + +When you make analysts faster as a DE, you’re doing your job. + +## Symptoms of bad data modeling + +- Slow dashboards + - This is where grouping sets can be really useful + - If you’re using row or daily lvl data in your dashboards, without pre-aggregating, eventually it’s gonna get slow if your company scales to a big enough level → Your dashboard is not gonna work + - With pre-aggregation instead, it will be infinitely scalable + - What Zach did at Facebook, since providing row lvl data to Tableau would be impossible, was to pre-aggregate data to the dimensions that people cared about (e.g. country, device, app, etc…) → going from billions of rows to few rows. +- Queries with a weird number of CTEs + - If there’s 10 CTEs in the queries, especially if you see the same queries over and over, then you probably want to implement a staging step. If it’s always the same 10 CTEs, probably the first 5s should be materialized somewhere in a table. + - Storage is cheaper than compute, remember this about data modeling (and you get time back by storing things). +- Lots of CASE WHEN statements in the analytics queries + - This means that you data model is not robust enough, or not conformed enough, i.e. you’re not conforming the values to what they need to be. diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1la_image.png b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1la_image.png new file mode 100644 index 00000000..f022d174 Binary files /dev/null and b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1la_image.png differ diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 1.png b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 1.png new file mode 100644 index 00000000..b1f8500e Binary files /dev/null and b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 1.png differ diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 2.png b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 2.png new file mode 100644 index 00000000..d2f63be5 Binary files /dev/null and b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 2.png differ diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 3.png b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 3.png new file mode 100644 index 00000000..44df8cd6 Binary files /dev/null and b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 3.png differ diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 4.png b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 4.png new file mode 100644 index 00000000..bef58ba1 Binary files /dev/null and b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image 4.png differ diff --git a/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image.png b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image.png new file mode 100644 index 00000000..0cc6287a Binary files /dev/null and b/bootcamp/materials/4-applying-analytical-patterns/markdown_notes/images/d1le_image.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 1 - Lab.md b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 1 - Lab.md new file mode 100644 index 00000000..ab95c9c9 --- /dev/null +++ b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 1 - Lab.md @@ -0,0 +1,41 @@ +# Day 1 - Lab + +First of all, to setup, open an account on Statsig, get a free Server API key from the settings, and add it to your env variables. + +Then, install the dependencies for this module (kpis and experimentation), and finally run the flask server with `python src/server.py`, and visit `localhost:5000` to make sure everything is working properly. + +If you now go to `/tasks`, you’ll see you have received a specific color. + +Now open [statsig console](https://console.statsig.com), click “experiments” on the left, and click “Get started”. Call the experiment `button_color_v3` and in the hypothesis write “I think the red button is the best one”, then click “create”. + +Now you want to think about the metrics, let’s pick “dau” (daily active users). + +Then, at the bottom of the page, under “Groups and Parameters”, click “Add a parameter”, call it “Button Color”. Then put “Blue” in control and “Red” in test. You can add a couple more groups and give them value “Green” and “Orange”, respectively. No we have 4 groups all evenly split (25% each). + +When you’re done, click “Save”, and at the top of the page click “Start” then “Start” again in the modal that appears. + +If you now go back to `localhost:5000/tasks` and add this query `?random=true`, and refresh the page a few times, you’ll see the color changes each time. This is to simulate different users. + +Now what we just did, is we’re simulating different daily active users, so we will know if someone is daily active. But now we want to know “ok, do they do more, after visiting a page?” + +If you look at the bottom of the `tasks` page, see there’s a little link to “signup”, which you can click. + +In the code, under the `signup()` function, you will see that we’re calling `statig.log_event()`. This is to communicate to Statsig that the user visited the signup page. + +Now play with these links a bit to generate some data, then go back to statsig → experiments → `button_color_v3` → Diagnostics, and you will see a bunch of events we created. + +Unfortunately, it can take a while, even a day, for Statsig to generate some reports, so this lab ain’t exactly comprehensive. + +One thing that you might want to keep in mind is “when is stuff logged”. Right now, our events are logged server-side, but you can also do client-side logging, and they have different benefits and risks. + +Server side logging is easier to set up, whereas client side logging is trickier. One of the reasons is that we don’t want random clients to just log whatever they want to our servers, so you have to deal with OAuth, to authorize these requests. + +On the flip side, you get better fidelity data, because you can track when the user clicks the action, rather than when the server picked it up. So you get better, higher quality events, as well as more variety, like scroll time, view time etc… + +In Statsig there are other fancy functionalities that you can control, for instance you have “Feature gates”, that are like toggles, e.g. enable or disable a feature to see how the user behaves. + +Another thing that you might find in Statsig, is that you can split your experiments in groups, and notice that something that is statistically significant for one group, might not be significant for another. + +This is important to consider, because overall you might see (or not) an effect, but if you break it down to smaller groups, you might notice something completely different. + +One useful feature in Statsig is that you can tag your metrics. You can create new tags under settings, and then you can assign them to metrics however you please. This way, you could for instance tag certain metrics as **guardrail** (we seen in previous lectures that guardrail metrics are those metrics that if they behave a certain way, e.g. if they go down, they block the new feature from being released). \ No newline at end of file diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 1 - Lecture.md b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 1 - Lecture.md new file mode 100644 index 00000000..df15d625 --- /dev/null +++ b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 1 - Lecture.md @@ -0,0 +1,177 @@ +# Day 1 - Lecture + +# Intro + +Thinking like a product manager is critical for being a good data engineer. This looks like pipelines that actually impact the business: + +- Building good metrics that change business decision making +- Build metrics that are impacted by experiments + +In this lecture, we will look at metrics and Statsig. + +# Why do metrics matter at all? + +It really depends on the company and many other different things. In some companies, metrics matter less than in others. It depends on the culture of the company rather than the metrics themselves. E.g. difference between Airbnb and Facebook, Airbnb cared about metrics less than Facebook. + +But metrics are important, they provide visibility, they explain the world and your business, especially as you get more and more of them. The more clarity and visibility you have, the fewer “icebergs” you’re gonna crash into. + +--- + +Things we’ll cover today: + +- Metrics play a huge part in data modeling +- Making sure metrics can’t be “gamed” +- The inner workings of experimentation +- The difference between feature gates and experimentation +- How do you plug metrics into experimentation frameworks? + +# Complex metrics mean complex data models + +When you’re building the spec (as we’ve seen in previous lectures), this concept you should keep in mind at all times. + +If stakeholders are asking for something that’s more than simple aggregations, but instead some weird ultra specific metric, like average rolling percentile sum whatever, that’s more a data scientist job, most times you should push back on it. You should give them the raw aggregates, and let the analytics people figure it out afterwards. + +Don’t let the data scientists dictate your data models. + +# Types of metrics + +- Aggregates / Counts +- Ratios +- Percentiles (p10, p50, p90 values) + +Generally speaking, as a DE, you should be mostly supplying aggregates and counts. And what grain you supply those at is usually at the entity grain (daily metrics → user_id, metric_name, metric_value). + +**Aggregates / counts** + +The Swiss army knife and the most common type data engineers should work with. + +**Ratios** + +Ratio metrics are in general about indicating quality. Data engineers should supply numerators and denominators, **NOT THE RATIOS** between them + +Examples: + +- Conversion rate +- Purchase rate +- Cost to acquire a customer + +Keep in mind that when you cut by any dimension (e.g. operative system), then the numbers might not add up with ratios (additive vs non additive properties). + +**Percentile metrics** + +Useful for measuring the experience at the extremes + +Examples: + +- P99 latency → e.g. how fast is our worst experience (top 1% latency) when a website loads? +- P10 engagement of active users + +# Metrics can be “gamed” and need balance + +- Experiments can move metrics up short-term but down long-term + - E.g. Notifications at facebook: + - Send more → get more users in the short term, in the long term you lose that lever, because what happens is that people turn of settings, and then you can’t access those users anymore + - Create other metrics that can measure this stuff, like “reachability” (% of users who turned off settings). + - In the case above, they would notice that spamming people with notifications would be detrimental +- Fiddle with the numerator or the denominator + - You can really get your metrics / experiments to tell you whatever you want + - Really have clear hypothesis when you start your experiments so you can test them + - “P-hacking” → avoid it +- Novelty effects of experiments + - When you introduce new things, people get excited about them, but then it fades out as the novelty of the new feature wears off + - Be aware for how long the experiment should run +- Increased user retention at extreme cost + - Netflix example: refreshing the feed in the background, when the app is not being used, increases retention + - But at what cost? If it’s millions of $ in AWS costs, maybe the feature isn’t worth it + - Make counter metrics → e.g. AWS cost for the experiment / we get this much increased retention at X cost. + - Also, diminishing returns play a big role + +# How does an experiment work? + +- Make a hypothesis! +- Group assignment (think test vs control) +- Collect data +- Look at the differences between the groups + +## Hypothesis testing + +Where you have the “null hypothesis” and the alternative hypothesis. + +- In the null hypothesis → there is no difference between **test** and **control** +- The alternative hypothesis → there is significant difference from the changes + +Remember: you never prove the alternative hypothesis → Instead, you fail to reject the null hypothesis! + +# Group testing + +Who is eligible? How we assign group members? + +- Are these users in a long-term holdout? + - A long-term holdout is a long running experiment over a group of people (e.g. not being sent notifications), so these shouldn’t be considered for other experiments +- What percentage of users do we want to experiment on? + - A lot of the time, your experiment groups are not 100% of your users. It can be some fraction of your users based on certain criteria. + - In big tech you get the luxury to experiment on a small % of users that you don’t get in smaller companies, because you don’t have a lot of users to begin with, and so you don’t have a lot of statistical power with small % of users. + +### Group assignment + +![image.png](images/d1le_image.png) + +Logged in-experiments are more powerful than logged-out ones, because you have a lot more information about those users. + +Remember that you need to track your events during the experiments. They can happen either on client or on the server, depending where you do your logging. Statsig offers APIs for both. + +A missing piece of this diagram: + +- What about other metrics that aren’t on Statsig? +- You can have some kind of ETL process that dums this data on it. + +## Collect data + +You collect data until you get a statistically significant result. Generally speaking, at least a month is a good duration. + +> The smaller the effect, the longer you’ll have to wait to get a statistically significant result +> + +Keeping in mind some effects may be so small you’ll never get a statistically significant result! + +Make sure to use stable identifiers (Statsig has it) + +- Hash IP addresses to minimize the collection of PII +- Leverage Stable ID in Statsig + +- Do not UNDERPOWER your experiments + - The more test cells you have, the more data you will need to collect! + - It’s unlikely that you have the same data collection as Google and can test 41 different shades of blue. +- Collect all the events and dimensions you want to measure differences, and make sure you’re logging them BEFORE you start the experiment. + +## Look at the data + +![image.png](images/d1le_image%201.png) + +In Statsig, if you have a bar that overlaps 0, so you have both positive and negative results, then it doesn’t count as statistically significant. + +In this specific experiment, only 1 bar is (the green one), however Zach choose a p-value of 0.2 (80% confidence interval). + +- P values are a method we use to determine if a coincidence is actually a coincidence or not + - If p-value < 0.05, then there’s a 95% chance that the effect you’re looking at is not due to coincidence, but some other factor. + - The lower the p-value, the higher certainty you have that the effect is not due to randomness +- P < 0.05 is the industry standard. Although depending on your situations you might want higher or lower. + +**Statistically significant results may still be worthless!** + +- Maybe a result is significant but it’s a tiny delta + - like a 1% lift if using a red button vs blue button +- Maybe there are multiple statistically significant results in opposite directions + +### Gotchas when looking for statistical significance + +Imagine you’re measuring notifications received, and in a group you have normal people, and in another you have Beyonce, if you look at the averages, the 2nd group will be much higher. You wanna be careful about extreme outliers. + +- **Winsorization** helps with this (clip the outlier to a lesser value, like 99.9 percentile) +- Or looking at user counts instead of event counts + +### Statsig can create metrics. How about adding your own? + +You can add your own user-level metrics to Statsig via batch ETLs. + +This is a common pattern in big tech for data engineers to own these. diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 2 - Lecture.md b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 2 - Lecture.md new file mode 100644 index 00000000..420ac969 --- /dev/null +++ b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/Day 2 - Lecture.md @@ -0,0 +1,166 @@ +# Day 2 - Lecture + +# Leading vs lagging metrics + +Ultimately, every metric that you write should be correlated, in some way, with money. And the metric can even be money itself. Revenue is its own metric. + +One of the problems of “revenue” as a metric, is that is quite slow moving, and there’s a lot of things that have to happen before revenue happens, therefore revenue is called **“lagging metric”**, whereas the alternative might be called **“leading metric”**. + +As Data Engineers, the data that we provide should be linked to revenue either directly, or indirectly by decreasing costs, which can happen in all sorts of ways: optimizing pipelines, making a process more efficient by providing better data etc… + +One exception to this rule is where you’re working in a non-profit, in that case probably your North Star metric is different, but even then, they still care about reducing spend. + +--- + +- Are we measuring inputs or outputs? + - Does our effort impact our output? +- Are inputs and outputs correlated? (i.e. conversion rate) + +Imagine a funnel like social media impression → website visits → signup → purchase. The further away you are from money, the more the metrics are “leading” or “input”. + +This idea of leading vs lagging metrics can also apply to your life. E.g. if you’re getting a job, you can say “I’m gonna spend 100 hours practicing SQL”, and that can be your “input” metric. + +The final funnel metric is going to be “jobs received”. You can have an extremely lagging metric, like “in 5 or 6 years I’m gonna make a bootcamp where I teach people how to get better jobs”. + +Anyway, when applying for a job you have a lot of different things to think about: number of applications sent, how many recruiters you talk to, how many interviews you pass, etc… There’s a lot of steps. + +The conversion rate here can be “hours spent applying divided by job interviews”. + +## The learning / job funnel + +![image.png](images/d1le_image.png) + +### The top of the funnel + +This is where you have your learning and networking. + +Common mistakes + +- Tutorial hell +- Going too broad +- Not networking +- Networking only on LinkedIn + +![image.png](images/d1le_image%201.png) + +### The middle of the job funnel + +Common mistakes + +- Applying only to jobs you’re 100% qualified for +- This makes your growth very slow +- Only filling out apps, not networking + +![image.png](images/d1le_image%202.png) + +### The bottom of the job funnel + +![image.png](images/d1le_image%203.png) + +Common mistakes + +- Not asking questions in interviews + - “If you could change 1 aspect of your job, what would it be?” + - “If I’m 2 weeks behind in a project and I don’t know which direction it’s gonna go, how would you react as a manager?” + - Don’t make the interview one-sided by being passive +- Not recognizing unsupportive managers + - An unsupportive manager is one of the most detrimental things you can have in your career +- Focusing only on getting better at code + +### The end of the job funnel + +Common mistakes + +- Staying in a toxic job because of the pay +- Not caring enough to help others + +![image.png](images/d1le_image%204.png) + +# Product analytics example of leading vs lagging + +- The most extreme leading metric is: ad spend on an impression +- The most extreme lagging metric is: a testimonial or a repeat purchase + +### The funnel that runs the world + +![image.png](images/d1le_image%205.png) + +### Top of the funnel + +![image.png](images/d1le_image%206.png) + +One of the reasons Zach was able to reach us with this course, is because for 2.5 years, Zach did tons and tons of impressions on LinkedIn → Organic growth. + +Each step of this funnel can be A/B tested. + +Also, one of the biggest things you need to do, when you have a bunch of different channels that are spilling into your business, you need to track where people are coming from. + +E.g., for Zach, his biggest source in terms of conversion rate was Instagram. Tiktok converted 0, even if at the time he had 31k followers, and LinkedIn converted the most in absolute numbers due to his large following. So you need to understand where to invest the most. + +Another thing that’s important is to avoid repeating the same steps of the funnel for the people that already went through them. One example is Zach collecting emails before he had a product to sell, so that when he did, he could contact his audience directly, without repeating the first 2 steps (impressions / mail acquisition). He did that by promising a newsletter, which also didn’t exist, and came like a year later or so. + +You also want to test responsiveness and page speed at this stage, and optimize for them. Remember that not everyone in the world runs on an iPhone 14 with the best specs. + +### Middle of the funnel + +![image.png](images/d1le_image%207.png) + +Experiments and purchases are trickier because here you have more risks when fiddling with stuff. You risk upsetting current customers, and you wanna be careful about that. + +You also have to understand elasticity, or how people respond in price changes (a % change in price corresponds to how much % change in demand?). Ideally you want to optimize the pricing structure in order to maximize your gains. + +Pricing strategy is a big factor: imagine in Starbucks, you have 3 cups, small for $4, medium for $5.50, and large for $6, most people are gonna buy the large, because the medium is just a decoy. + +### Retention part of the funnel + +![image.png](images/d1le_image%208.png) + +A big part of engagement / retention is having a smooth onboarding process. Another one is schedule, admittedly this example is fit on Zach’s bootcamp itself. + +### Virtuous cycle of funnels + +![image.png](images/d1le_image%209.png) + +To have someone go from “engaged” to referral or testimonial, you need to impact their life in some way, more than just giving them knowledge. You need to give them something else, like a mentor, or network, or other stuff (again this example comes from Zach’s experience with the bootcamp). + +This is the next piece of the puzzle, to get people to be all the way down the funnel, and they become your brand champions so that you don’t have to market anymore, thanks to word of mouth. + +--- + +You’ll notice that in every single one of these steps have metrics that you can measure. Thinking like a product manager is how you should reason as a Data Engineer. If you’re doing product analytics, and you come up with a metric, you should be able to come up with a cohesive story about where that metric plays in the funnel. + +Example: + +In notifications at Facebook is part of the “retention” step, that’s how you get people engaged. Another one in that step is “friending”. + +If your metric doesn’t fit in this funnel anywhere, that’s probably a waste of time. There are some minor exceptions, but then you wouldn’t be a product focused DE anymore (e.g. if you were using on optimizing cloud costs). + +> Everything that you do in product analytics, always be thinking about funnels, ‘cause if you think about funnels, you’re going to impact the business way more, and it will help you put your work into context and help you build more robust data models. +> + +# How to think like a product manager? + +> At every step of the funnel, think “What is causing pain?” and “What is sparking joy?” +> + +**Examples from Zach’s bootcamp** + +For prospects + +- One of the thinks that spark joy in the prospective students was Zach’s content on social media about data engineering. +- People saw him as knowledgeable and signed up for the newsletter + +For new students + +- When buying the bootcamp, you get immediate access to a huge content library. students like that +- On the other side, the hiccups of getting people on Discord and Github, not being automated. + +For engaged students + +- Students are happy when they get put into groups so they feel less alone +- Students are happy when they get mentors as a reward for being engaged +- Students feel pain when they get overwhelmed by the intensity of the bootcamp + +For referral / testimonial + +- Students feel very happy when they land a better job from the skills they learned in the boot camp. They tell everybody they know about the boot camp. diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d1le_image 1.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d1le_image 1.png new file mode 100644 index 00000000..0c859860 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d1le_image 1.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d1le_image.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d1le_image.png new file mode 100644 index 00000000..be1e2beb Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d1le_image.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 1.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 1.png new file mode 100644 index 00000000..c21644ce Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 1.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 2.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 2.png new file mode 100644 index 00000000..eb4a276f Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 2.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 3.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 3.png new file mode 100644 index 00000000..a7928728 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 3.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 4.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 4.png new file mode 100644 index 00000000..e2416ff4 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 4.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 5.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 5.png new file mode 100644 index 00000000..a9d8b408 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 5.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 6.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 6.png new file mode 100644 index 00000000..54ee53f6 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 6.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 7.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 7.png new file mode 100644 index 00000000..a7450165 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 7.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 8.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 8.png new file mode 100644 index 00000000..7086584e Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 8.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 9.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 9.png new file mode 100644 index 00000000..92a0eb82 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image 9.png differ diff --git a/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image.png b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image.png new file mode 100644 index 00000000..b6647d70 Binary files /dev/null and b/bootcamp/materials/5-kpis-and-experimentation/markdown_notes/images/d2le_image.png differ diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 1 - Lab.md b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 1 - Lab.md new file mode 100644 index 00000000..c10a191b --- /dev/null +++ b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 1 - Lab.md @@ -0,0 +1,34 @@ +# Day 1 - Lab + +In this lab, we create a pretend runbook for EcZachly Inc Growth Pipeline. + +--- + +# On call runbook for EcZachly Inc Growth Pipeline + +Primary Owner: Zach + +Secondary Owner: Lulu + +## Common Issues + +**Upstream Datasets** + +- Website events + - Common anomalies + - Sometimes referrer is NULL too much, this is fixed downstream but we are alerted about it because it messes with the metrics *[Non blocking DQ check — Ed]* +- User exports + - Export might fail to be extracted on a given day. When this happens just use yesterday’s export for today. + +**Downstream consumers** + +- Experimentation platform +- Dashboards + +## SLAs + +The data should land 4 hours after UTC midnight + +--- + +Obviously, this runbook is the ugliest thing ever, but the idea here is that now, if someone goes to this pipeline, they have a document they can refer to. \ No newline at end of file diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 1 - Lecture.md b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 1 - Lecture.md new file mode 100644 index 00000000..99ae873a --- /dev/null +++ b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 1 - Lecture.md @@ -0,0 +1,222 @@ +# Day 1 - Lecture + +# Intro + +Data pipeline maintenance is a inevitable part of DE. In this course we will cover: + +- How to do data migration? +- How to setup runbooks for oncall rotation? +- Ownership models (so that your DEs don’t become burned out) +- Common Ownership and Maintenance Problems + +--- + +We will also cover the following topics: + +- The difficult parts of the data engineering job +- Team models for data engineering + +As a DE, every time you build a pipeline, you then have to maintain it. + + + +Imagine every pipeline you have has a 10% chance of failing on a given day. + +- If you have 1 pipeline, that means if you’re on call, you’ll end up doing something 3 days a month. +- If you have 10 pipelines, that means being on call you’ll have to do something every single day. +- If you have 100, every day you end up doing 10 things to maintain them. + +That’s something to think about, as you have more and more pipelines, the burden becomes unsustainable. + +> DE isn’t just writing SQL and sipping martinis on a beach +> + +As a data engineer, you’re very subject to burnout, so you need to protect your peace and wellbeing. + +If you don’t, DE is gonna eat you alive. It’s not a forgiving field. + +# The difficult parts of data engineering + +- High expectations that can lead to burnout +- Data quality issues +- Unclear priorities and ad-hoc requests + +These are 3 main areas that cause frustrations in these field, and also the main reason why people **leave** data engineering, ‘cause even at the best companies they still haven’t figured out how to manage them all correctly. + +## High expectations + +- Building pipelines is a **marathon** +- Analytics is a **sprint** +- **SAY NO TO SPRINTING THE MARATHON!** + +Every time you cut corners on a pipeline to get an answer faster, you will regret it later. You’re eventually going to feel pain in either of these 3 buckets: quality, completeness or maintainability. + +On the flip side, analytics is a sprint, so the faster you can provide answers, the better. + +So there is kind of a push and pull between analytics and engineering. But you never want to “sprint the marathon” and get burned out. In other words, don’t be naive see every issue as a life-and-death situation. Take your time to reach the solution, businesses don’t move very quickly, so even if things are 1 or 2 weeks delayed, it’s not that big of a deal. Keep an eye out on how much pressure you put on yourself. + +> We’re not saving lives in data engineering. +> + +(obviously you should have a healthy amount of care, lest you’re gonna get fired 😅). + +## Ad-hoc requests + +- Analytics need to solve **URGENT** problem + - But to them, everything is urgent all the time → if everything is urgent, nothing is urgent. + +**Strategy to address ad-hoc requests** + +Allocate 5-10% per quarter to ad-hoc requests + +→ that’s about 1/2 day per week. Or in a quarter is about a week or two at most. This is how much time you should spend on ad-hoc requests. The other 12-13 weeks should be spent on long term infrastructure building projects. + +When evaluating ad-hoc requests you should evaluate the ROI: + +- Means that if the request is ad-hoc and complex, you probably shouldn’t drop what you’re doing + - When things are complex, they should be put on the roadmap → Include it in the quarterly plan in a structured manner, rather than at random at any given time, so that as a team you can figure out the right ways to process this, and you don’t slap together a rushed solution. +- Low hanging fruit can be prioritized though! + +In other words, don’t be a simp to your analytics partners and say yes to everything they ask right away, or they will keep doing it nonstop and you will never get anything done. Make it so that they are forced to prioritize and stop shooting requests with a machine gun. Make them feel a little pain (e.g. delayed results) as a consequence of their approach, so they willingly work to change it for the better. + +- Most analytics questions are less urgent than they appear → we’re not saving lives +- Get analytics partners input for quarterly planning so ad-hoc requests become scalable models! + +## Ownership models + +Who owns what? + +- Data sets +- Data pipeline +- Data documentation +- Metrics +- Experiments + +DEs might end up owning too much stuff, which obviously in the long run becomes unsustainable. Let’s see some ownership models. + +**The most common ownership model** + +![image.png](images/d1le_image.png) + +In reality, the edges are never so clearcut, ownership near them is more blurry, also depending on the people available in each team. Zach prefers this pattern than the next one because in the next one the data engineer risks becoming like an island, having too much knowledge of everything in a single individual. + +**Another common pattern** + +![image.png](images/d1le_image%201.png) + +This pattern can also be kind of good: owning the “metrics” as well for the data / analytics engineer team can be convenient because it’s very easy to update them, since they already know the master data very well. + +The trade-off being that often the data engineer doesn’t have enough time to develop the business context that they need to get good metrics (as this requires many conversations with many different people). + +### Where do ownership problems arise? + +At the boundaries! + +- Data engineers owning logging! +- Data scientists writing pipelines! +- Data engineers owning metrics! + +Basically, people owning things they shouldn’t be owning because other roles are lacking somewhere. + +Try to avoid going for *“whatever, I’ll do it myself!",* because you’re just sowing the seeds of your own burnout and creating a team structure that’s not sustainable. + +### What happens when these crossed boundaries persist? + +- **BURNOUT!** → People quit, get angry, get frustrated +- **BAD TEAM DYNAMIC / THE BLAME GAME!** → People blaming this or that person, or thinking you stealing credit, or not giving them time, or taking their work and stealing the promotion, etc… +- **BAD CROSS-FUNCTIONAL SUPPORT!** → Don’t be the sole owner of an entire pipeline, and have some backup, so if you leave or go on vacation, the business can have some questions answered. Solve these problems at the organizational level, not at the technical level. In other words, talk to the managers so that they can get the people that need to do a certain job to actually do it. It’s better to do that than to be the hero. + +# Centralized vs Embedded teams + +![image.png](images/d1le_image%202.png) + +Zach prefers **centralized data teams**. The tradeoff is that the teams that you support have to come to you, and you have to be able to prioritize their asks vs other teams asks. + +Oncalls are a big part of data engineering, because some pipelines will fail, and some of them are quite sensitive as they need data to land within a specific timeframe. + +Although, Zach did a small experiment at Airbnb where he intentionally let pipelines fail for a couple of days, and showed that the ML model that was fed by them would only lose 0.5% effectiveness for each day the pipeline was failing. + +The point for him was to show that troubleshooting that pipeline at 3:00AM was not worth the hassle! + +> Understand what the impact of a failing pipeline is, and if it’s huge impact, like $1M, then by all means do wake up at 3:00AM to fix it, but if the impact is marginally small, like $1k, then it’s most likely not worth it. +> + +In other words, you want to make oncalls as easy as possible, and seriously evaluate on which situations it is worth to do oncalls. + +Because a $1k loss might seem not insignificant, but imagine an engineer waking up 10 times at 3:00AM within 2 months to fix some shit and save the company $10k. That engineer is gonna quit, and the cost of replacement is going to be A LOT higher than $10k. + +--- + +The other model is **embedded**. In this one you have essentially one DE per team. Zach doesn’t like this one as much, because here the DEs are dedicated to a whole business area, but more isolated w.r.t. other data engineers. Oncall here is even more complicated because who’s on call will be an heterogeneous group of people with all different skillsets, rather than all data engineers (where you have an expectations of what their skillsets are). + +# Common issues in pipelines + +- Skewed pipelines that OOM +- Missing data / schema change of upstream data +- Backfill needs to trigger downstream data sets +- Business questions about where and how to use data + +## How to fix skew + +Skew happens when in a GROUP BY or JOIN there’s a key with **a lot more data**. Imagine you did a viral post that receives 100M likes, and for every like you get also a notification. Whatever executor is gonna get that data in Spark, is probably going to choke and die. + +There’s 3 ways to fix this: + +- Best option: **upgrade to Spark 3** and enable adaptive query execution! +- 2nd best option: bump up the executor memory and hope it’s not more skewed later +- 3rd best option: update the job to include a [skew join salt](https://medium.com/curious-data-catalog/sparks-salting-a-step-towards-mitigating-skew-problem-5b2e66791620) + - In this technique, you use random numbers to make it so that if you have a skewed row, you split it up first and then aggregate, so that all that data doesn’t get shipped to 1 single executor. + +You can also get OOM not related to skew, although that’s rare, and usually it’s a symptom of somebody writing the job wrong (e.g. having a `.collect()` somewhere, which collects all data in the driver) + +## Missing data / schema change + +- Pre-check your upstream data! + - Especially if it’s data you don’t trust, like from a 3rd party API or data that can change anytime + - Have the DQ checks **BEFORE** the pipeline runs! → If the checks fail, the pipeline won’t even start +- Fixed in a collaborative manner: + - If it’s someone at your company → find the upstream owner and ask them to fix the problem + - If it’s 3rd party API → quite messy, most times you won’t be able to talk to them. In that case you have to change the code. + - They might also just turn off their API and in that case you’re kinda screwed + +You want to be aware if these situations are one-offs or repeat over time. It’s important to have runbooks and proper oncall handoffs, so you can identify pattern for failures and solve the root cause of these failures, rather than applying band-aids every time something breaks. + +## Backfill triggers downstream pipelines + +These are one of the most painful parts of DE. + +**For small migrations** + +- Do a parallel backfill into `table_backfill` +- If everything looks good, do the swap +- Rename `production` to `production_old` +- Rename `table_backfill` to `production` + +You want to do a parallel backfill into a separate table, like `table_backfill`, and then validate it and make sure that the data looks good. If it looks good, do the swap, where you rename the current production table to `production_old`, and `table_backfill` → `production`. + +However, if you’re working with **pipelines that have a lot of downstreams**, you can’t do it that way. Because you have to migrate hundreds of people. In this case, you want backfill into a table like `table_v2`, and then you build a parallel pipeline next to production that fills `v2`, and then you encourage everyone to move over to `v2`. After all the references to the prod table are migrated to `v2`, then you can drop `production`, and rename `table_v2` to `production` (and now everyone needs to update their reference again). It’s quite painful, but if you don’t do this, your table names are going to become more and more weird. + +In short: + +- Build a parallel pipeline that populates `table_v2` while `production` gets migrated +- After all references to `production` have been updated, drop it and rename `table_v2` (and all its references) to `production`. + +--- + +Here’s a flowchart to illustrate a common backfill pattern, mostly fitting the 1st scenario described above. + +![image.png](images/d1le_image%203.png) + +# Business questions + +Another thing that can happen, if you own data sets, people are gonna ask questions about them (how to query, info, etc…). You might have a channel where people post questions, most of the time you don’t wanna set people expectations (like I’m gonna answer within x minutes). + +- Set an SLA on when you’ll get back to people. Maybe it’s 2 hours maybe it’s a day. + - The point of SLA is to give you breathing room and not feeling you have to immediately respond when you’re on call, so you can still have a life while you’re on call. +- You’re gonna get the same questions over and over again → consolidate them and put them in the documents. +- Is the “business question on call” the same as the “pipeline oncall”? This comes back to “who is owning stuff”. In other words, if you’re oncall, are you actually supposed to be answering business questions too? Try to not do this when oncall, and instead loop in your analytics partners! diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 2 - Lecture.md b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 2 - Lecture.md new file mode 100644 index 00000000..dfe4b4d4 --- /dev/null +++ b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/Day 2 - Lecture.md @@ -0,0 +1,223 @@ +# Day 2 - Lecture + +# What are signals of tech debt in data engineering + +- Painful pipelines that break or are delayed +- Large cloud bills +- Multiple sources of truth +- Unclear data sets that aren’t documented + +Each of these has their fixes which we’re gonna go over. + +## Path forward for painful pipelines + +Imagine you’ve got a pipeline that’s breaking often and ruins your oncall time and makes your life miserable. + +- The only thing better than **optimized** **is****deprecated!** → Sometimes, deprecating is the right play! +- Is there anything that can be done? + - Migrate to new technology + - Like migrating from hive to Spark. Often sees an order of magnitude increase in efficiency. Especially for pipelines that use high-cardinality `GROUP BY` and/or `JOIN`. Remember that HIVE does everything on disk, so while Spark is more efficient, it’s less reliable/resilient and can OOM, whereas HIVE can’t. + - Moving to streaming: that should be thought about a lot. In some cases this might be more reliable as it processed data as it comes in, so the memory footprint of the job can be more consistent, and you don’t have big spikes in memory footprint (which are what make the job more likely to OOM). + - Better data modeling → fewer pipelines running + - Bucketing? + - Sampling? → A lot of time you just need directionality, not every single record. So with a 1% sample now you get a 100x reduction in the complexity of the pipeline. +- Deprecating without a replacement can be a hard sell! (Although, sometimes, that’s the only option left.) + - Deprecating is an option to consider especially if you inherit a pipeline. Question the value! Ask yourself “is this worth it?”. Maybe it’s not, maybe it shouldn’t even exist. + +These solutions seem straightforward and plain, but they’re smart! If the pipeline is a burden and not all too necessary, they’re very worth to consider. + +### Sampling + +**When should you sample?** + +If directionality is what you care about, sampling is good enough. Make sure to **consult a data scientist** on this to guarantee you get the best sample of the data set → It’s very important to get a random sample. + +In Zach’s example, they were working with a pipeline that would process 100s of TBs per hour. It’s understandable that’s absolutely reasonable here to sample it. + +Another thing to consider is HOW you sample, e.g.: + +- A small % of all the requests across the board +- All the requests, but for a small % of users + +You’ll get the same reduction, but that makes a big different on the dataset itself and what you’re trying to solve for. + +When should you not sample? + +If you need the entire dataset for auditing purposes, sampling won’t work! + +### Bucketing + +Consider bucketing when + +- You have an expensive high-cardinality `JOIN` or `GROUP BY` → When you write the datasets, bucket them on the JOIN or GROUP BY key. + - The point is that past a certain size of data (like 10+ TB), shuffle operations just break, so you want to resort to alternatives + - Bucketing allows you to avoid the shuffle when doing those operations. + +However, this doesn’t mean everything should be bucketed. On small data, it’s probably not worth it. Remember that reading from many different files takes time. + +## Large cloud bills + +IO is usually the number one cloud cost! + +→ Moving data around from A to B is likely the most expensive part (e.g. imagine you have large master data and hundreds of pipelines reading from it). + +→ At the bottom of the tree, where a pipeline only feeds a dashboard for instance, IO is not the biggest cost anymore. + +Followed by compute and then storage! + +→ Compute is generally fixed (unless you use serverless…), cause you rent a certain capacity and that’s it. + +→ Storage is just really cheap in general + +Too much IO can happen because: + +- Duplicative data models +- Inefficient pipelines → use cumulative table design! + - Imagine you’re doing a monthly active users pipeline, and you’re reading the last 30 days of fact data instead of doing cumulative table design. You’re scanning 30 days where you could be scanning just one day! +- Excessive backfills + - You did a backfill too quickly without validating the data, and then it’s wrong data, and then you need to do it all over again + - Backfill can be very expensive, imagine having to run the query over years and years of data +- Not sampling + - Using the entire dataset when you can just use a small fraction of it +- Not subpartitioning correctly (predicate pushdown is your friend) + - Subpartitioning is great when you have other low cardinality column you can split the data on. It allows you to just completely ignore certain data altogether. + + ![Screenshot 2025-02-01 232312.jpg](images/d2le_Screenshot_2025-02-01_232312.jpg) + +--- + +Large IO and compute costs are correlated by: + +- Scanning too much data (use cumulative tables plz) +- O(n^2) algorithms applied in UDFs (nested loops are usually bad) + +Large IO and storage costs are correlated by: + +- Not leveraging Parquet file format effectively +- Duplicative data models (having multiple definitions for the same thing) + +## Multiple sources of truth + +This is probably where you can have the most impact as a data engineer; and it makes you more efficient, and it makes your job easier, and it makes the maintenance better. + +- This is some of the hardest (but most impactful) work for data engineers + +Steps: + +- Document all the sources and the discrepancies +- Understand from stakeholders why they needed something different +- Build a spec that outlines a new path forward that can be agreed upon + +It’s very common, as an org gets bigger and bigger and the data needs increase, that due to the lack of data engineers, people are gonna solve their data problems by themselves, without caring too much about quality or if the dataset already exists somewhere else, and then they’ll keep doing that. + +This work is complicated because: + +- When people do these multiple sources and pipelines that produce the same thing, they might have different variations of the same definition → that means you have to convince somebody that their definition is wrong/not-correct. +- You wanna get all stakeholders in a room, and talk it out and see how to address it. +- Sometimes, all existing sources of truth might be wrong and you have to define a new one. + +Generally speaking, you don’t find all existing sources of truth via code search or grep, because people can name their data whatever they want, so it’s always suggested to talk to the relevant stakeholders. It can also can help you understand why they’re using this data, and the pains of using it, and all sorts of things that will give you a better view of the situation. Also if you take a pipeline out of someone’s plate, they’ll love you because it means they won’t have to manage it anymore. + +After you’ve done all of this, that’s when you want to build a new spec for the path forward, and have the stakeholders agree to it, BEFORE you build anything. + +## Document all the the sources and the discrepancies + +If your company is really ahead of the game they might have lineage, so that you can go up to the source data. + +- Talk with all relevant stakeholders! +- If you can code search for similar names, that works great +- Lineage of the shared source data is a good place to start too + +> Microsoft Fabric seemingly gives lineage out of the box +> + +### Understand from stakeholders why they needed something different + +- There’s usually a reason why there’s multiple datasets +- Could be organizational, ownership, technical, skills, etc… +- Work with stakeholders to find an ownership model that works + +A key point is understanding how we have gotten to this certain point. E.g. sometimes a team doesn’t trust what another team built, and so they rebuild it themselves. Another time is when a team doesn’t have the bandwidth to do a certain thing, so another team does it. + +If you can solve the organizational problems, then you stop the bleeding, you stop this problem from proliferating, and get people to trust each other more, and more consolidated datasets + +# Build a pipeline spec + +- Capture all the needs from stakeholders +- Get all the multiple sources of truth owners to sign off (**this helps a lot with migrations later!!!**) + +Once they sign off, when it’s time for them to actually move and migrate their downstream pipeline, then you can kinda bully them into doing it because you can say “yo, you signed that you were gonna do it”. Migrations are so painful, and having a document like this can be very helpful in having people actually do them. + +# Models for getting ahead of tech debt + +- Fix “as you go” + - This sounds a bit too good to be true, it’s kinda like saying “as you’re flying, fix the plane!” +- Allocate a portion of time each quarter **(often called tech excellence week)** + - Zach’s favorite + - Also the heaviest, but probably the only one taken seriously + - One of the problems with this is that you don’t have the “fix as you go at all”, so at the end of the quarter there’s a lot more tech debt in the codebase + - It’s like saying “I’m gonna save up all the teeth brushing, and then do it once for 90 minutes at the end of the quarter” + - During that week, people don’t have time for anything else → it amounts to 1 month a year +- Have the on-call person focus on tech debt during their shift + - The cool thing about that is that they’re very aware of what the most urgent things are + - Contrary to the “dedicated week”, unfortunately it’s not everybody participating, and sometimes the on-call person just doesn’t do it + +![image.png](images/d2le_image.png) + +# Data migration models + +- The extremely cautious approach + - Be careful not to break anything + - Parallel pipelines for months → expensive cause you’re paying 2x for the pipelines for the duration of the migration +- The bull in a china shop approach + - Efficiency wins, minimize risk by migrating high-impact pipelines first + - Kill the legacy pipeline as soon as you can + +When Zach was at Airbnb, they used to use the first approach, but it was lasting too long, so leadership came out and said “you’re all taking too long”, as people were too cautious about breaking other people stuff. + +What happened later is that instead, people were given a month to migrate, and if they don’t migrate in a month, you start breaking stuff, so that there weren’t these crazy long periods of time where migrations would take forever, and people would deprioritize moving the tables and doing backfills etc… + +This is the most boring part of DE, so people don’t wanna do it unless they have to, and the way to force them to do it is by deleting the old pipeline, and you make it so that their pipeline can’t run anymore, because their upstream data doesn’t exist, and the only way to make their pipeline run again is by migrating. + +Both approaches have benefits and cons, obviously, although Zach prefers the 2nd style, as migration is not usually a lot of work to do, and people tend to slack off. + +# Proper oncall responsibilities + +- Set proper expectations with your stakeholders! → **The most important part of on call** + - If you get this right, on-call can be a breeze + - When Zach was at Airbnb, initially the expectation was that any bug was gonna be troubleshooted in 4 hours, and then after 6 months Zach was like “fuck this, we’re doing 24 hours”, as 4 hours had no impact downstream, but would make DEs life miserable. + - It’s withing your power to change expectations of oncalls that you inherit +- **DOCUMENT EVERY FAILURE AND BUG** (it’s a huge pain short-term but a great relief long-term!) +- Oncall handoff + - Usually oncall lasts a week, and then you pass to another person + - Should be a 20-30 mins sync to pass context from one oncall to the next. Context is things like anything that broke, or is currently broken, etc… + +# Runbooks + +- Complex pipeline need runbooks (which should be linked in the spec too) + - Not all pipelines need runbooks. + - You only need them if you have lots of inputs, or outputs used by a lot of teams, or complicated logic, or complex DQ checks, etc… + +**What’s the difference between a runbook and a spec?** + +Most important pieces + +- Primary and secondary owners → if you’re oncall and something breaks and you have no idea how to fix it, who do you call? +- Upstream owners → **teams, not individuals** + - This is in case DQ checks on upstream data fails, so you need to call them and tell them that the data sucks +- Common issues (if any) + - And how to troubleshoot them (assuming they can’t be fixed long term yet) +- Critical downstream owners → **teams, not individuals** + - People that you need to when something happens + - Doesn’t need to be all of them, just the most important ones +- SLAs and agreements + - Usually these are like number of hours/days when the data is expected to arrive + - This is an agreement between you and your stakeholders that says “the data is not late until X hours after midnight”. Which avoids a lot of unnecessary questions. + +**Upstream and downstream owners** + +![image.png](images/d2le_image%201.png) + +You should have a regularly (monthly / quarterly) 1-to-1 meeting with both upstream and downstream, just to get everyone on the same page and trying to understand where they’re trying to go, and if we can improve these datasets to make them better. This is good because you develop a better connection with them and the business, and you develop a nicer relationship with them, which makes your life easier over time. + +You definitely wanna do relationship building with these people. diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 1.png b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 1.png new file mode 100644 index 00000000..c290fd1e Binary files /dev/null and b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 1.png differ diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 2.png b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 2.png new file mode 100644 index 00000000..cfad4a4d Binary files /dev/null and b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 2.png differ diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 3.png b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 3.png new file mode 100644 index 00000000..76f511f4 Binary files /dev/null and b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image 3.png differ diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image.png b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image.png new file mode 100644 index 00000000..1121dd6a Binary files /dev/null and b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d1le_image.png differ diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_Screenshot_2025-02-01_232312.jpg b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_Screenshot_2025-02-01_232312.jpg new file mode 100644 index 00000000..5233553b Binary files /dev/null and b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_Screenshot_2025-02-01_232312.jpg differ diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_image 1.png b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_image 1.png new file mode 100644 index 00000000..eaa49d8e Binary files /dev/null and b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_image 1.png differ diff --git a/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_image.png b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_image.png new file mode 100644 index 00000000..365eb732 Binary files /dev/null and b/bootcamp/materials/6-data-pipeline-maintenance/markdown_notes/images/d2le_image.png differ