Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ You can find the rubric under the [Assignment](https://courseworks2.columbia.edu
| 5 | 2/19 | [Automated testing](lectures/lecture_05.md) | [Readings](readings/week_05.md), [Project Part 2](docs/project.md#part-2) | [Data profiling/quality](labs/lab_05.md) | [Lab 4](labs/lab_04.md) |
| 6 | 2/26 | [Organizing code](lectures/lecture_06.md) | [Readings](readings/week_06.md), [Project Part 3](docs/project.md#part-3) | [Continuous integration](labs/lab_06.md) | [Lab 5](labs/lab_05.md) |
| 7 | 3/5 | [Databases](lectures/lecture_07.md) | [Readings](readings/week_07.md) | [Databases](labs/lab_07.md) | [Lab 6](labs/lab_06.md) |
| 8 | 3/12 | [Guest speaker; data warehousing](lectures/lecture_08.md) | [Project Part 4](docs/project.md#part-4) | [Data loading](labs/lab_08.md) | [Lab 7](labs/lab_07.md) |
| 8 | 3/12 | [Data warehousing](lectures/lecture_08.md) | [Project Part 4](docs/project.md#part-4) | [Data loading](labs/lab_08.md) | [Lab 7](labs/lab_07.md) |
| 9 | 3/19 | none ([Spring Recess][recess]) | none | none ([Spring Recess][recess]) | none |
| 10 | 3/26 | [Data engineering (ETL)](lectures/lecture_10.md) | [Project Part 5](docs/project.md#part-5) | [Data loading, continued](labs/lab_10.md) | [Lab 8](labs/lab_08.md) |
| 11 | 4/2 | [Data engineering, continued (pipelines)](lectures/lecture_11.md) | [Readings](readings/week_11.md), [Project check-in](docs/project.md#check-in) | [Process mapping](labs/lab_11.md) | [Lab 10](labs/lab_10.md) |
Expand Down
23 changes: 1 addition & 22 deletions labs/lab_08.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,6 @@
# Lab 8

**Goal:** Understand different methods of loading data

---

## Data loading

- Append load
- Trunc(ate) and load
- Incremental load

---

Let's say you were given access to a random table that uses one of the three data loading methods above. How would you tell which it was?

---

### Incremental load

The trick is avoiding duplicates. Your script might then need to say something like:

1. What's the latest timestamp in the database?
1. Pull data from the API that's more recent than that.
**Goal:** Practice data warehousing

---

Expand Down
49 changes: 21 additions & 28 deletions lectures/lecture_08.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,6 @@

---

## _gestures at everything_

---

## Feedback

- Getting a lot of new information
- Don't understand where we're going

---

Next lecture, we'll zoom out.

---

## Guest speaker

> [John Paul Farmer](https://www.linkedin.com/in/johnpaulfarmer) served as the 3rd Chief Technology Officer of New York City, taking point on everything from broadband to digital services to AI. Prior to that, he spent a handful of years at Microsoft, building connections with cities and the civic tech community. Previously, he was Senior Advisor for Innovation in the White House Office of Science and Technology Policy under President Obama, where he confounded and led the Presidential Innovation Fellows. He has also served as an adjunct associate professor at a Columbia and a Fellow of the University of Pennsylvania’s Institute for Urban Research. Most recently, he served as President of a next-gen broadband technology company and is now the President of Smart City Expo USA.

---

## Intros

- Name
- What you're passionate about

---

## [Retro](../docs/project.md#retro)

Anything you'd like to share?
Expand Down Expand Up @@ -107,6 +79,27 @@ COMMIT;

---

## Data loading

- Append load
- Trunc(ate) and load
- Incremental load

---

Let's say you were given access to a random table that uses one of the three data loading methods above. How would you tell which it was?

---

### Incremental load

The trick is avoiding duplicates. Your script might then need to say something like:

1. What's the latest timestamp in the database?
1. Pull data from the API that's more recent than that.

---

## [Project Part 5](../docs/project.md#part-5)

---
Expand Down
Loading