Goal: Practice data warehousing
- You'll write methods to load continuously updated data into a database.
- You'll set up scripts to perform each of the methods of data loading into DuckDB.
- You'll pair in your Lab group.
- Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.
- We will be using Consumer Price Index data from the Philadelphia Federal Reserve.
- We have monthly observations (rows) and monthly vintages (columns)
| DATE | PCPI04M1 | PCPI04M2 | PCPI04M3 |
|---|---|---|---|
| 2003:09 | 185.0 | 185.1 | 185.1 |
| 2003:10 | 185.0 | 184.9 | 184.9 |
| 2003:11 | 184.6 | 184.6 | 184.6 |
| 2003:12 | 185.0 | 184.9 | 184.9 |
| 2004:01 | #N/A | 185.8 | 185.8 |
| 2004:02 | #N/A | #N/A | 186.3 |
- A revision of past data is released in February of each year.
- A revision released in year
tcan update the values in yearst-5tot-1.
Suppose your organization wants to maintain a database of CPI data
- Write a
get_latest_datafunction that accepts apull_dateand returns the latest data available up to that date- For example, if the
pull_dateis 2004-01-15, the function should return the data from vintagePCPI04M1
- For example, if the
- Write code that pulls the latest data at a given
pull_dateand loads it into a DuckDB database- You will implement each of the methods
append,trunc, andincremental
- You will implement each of the methods
- Loop over a range of
pull_datesto simulate running the scripts on a daily basis - Compare the performance of each method (consistency and speed)
- Write out the usage and manual testing instructions as Markdown.
- We're doing this as documentation-driven development.
- What should the user expect to see in the table after running each script?
- Write the
get_latest_datafunction.- This function should return only two columns: e.g.
datesandcpi - All other code should interact with the source data only through this function
- This function should return only two columns: e.g.
- Work through each method of data loading.
- Include the type in the scripts and table names to keep them separate — something like:
_append_trunc_inc
- Your code should accept a
pull_dateparameter and load the data up to that date - The script should be able to run multiple times without duplicating data
- For incremental: a Python script may be easier than a SQL one
- Include the type in the scripts and table names to keep them separate — something like:
- On a notebook: simulate your organization running the scripts on a daily basis.
- Start from empty tables
- Loop over a range of
pull_dates(e.g. 2000-01-01 to 2025-02-28) to simulate running the scripts on a daily basis. - If the loop takes way too long, use a shorter range
- Compare the performance of each method (data consistency and speed)
- Submit links to the pull request(s) via CourseWorks.
This is a simpler version of the previous task. You can submit work fullfilling either version of the task.
- Write out the usage and manual testing instructions as Markdown.
- We're doing this as documentation-driven development.
- What should the user expect to see in the table after running each script?
- Use the data in the file PCPI24M1.csv to initialize your database. This file contains inflation information as available in January 2024.
- Create a persistent database containing three tables:
cpi_append,cpi_truncandcpi_inc. - For now these three tables look the same.
- Create a persistent database containing three tables:
- Load additional inflation data contained in the file PCPI25M2.csv into your database. This file contains inflation data as available in February 2025. It contains additional observations and historical revisions with respect to the previous file.
- Use each method of data loading to update the tables in your database.
- Now your tables might look different.
- Discuss how the three methods differ.
- Submit links to the pull request(s) via CourseWorks.