🎬 Netflix Pandas Mini-Project

Week 2 — Joining & Combining Data with Pandas

(Single-Source Dataset With Generated Auxiliary Tables)

This project demonstrates core data-joining techniques in Pandas using only a single source file: netflix_titles.csv.
To simulate real multi-table data engineering workflows, we generate two additional datasets inside the notebook:

credits — extracted from director and cast columns
ratings — synthetic user rating events for each title

This allows us to practice a wide range of Pandas merge operations without needing external files.

📂 Project Structure

netflix-pandas-project/
│
├── data/
│ └── netflix_titles.csv # original dataset
│
├── notebook/
│ └── netflix_pandas_project.ipynb # full project notebook
│
├── README.md

📘 Objectives

This project focuses on mastering:

✔ `merge()` — one-to-one & one-to-many joins

✔ `concat()` — vertical stacking

✔ `merge_asof()` — time-aware joins

✔ `merge_ordered()` — ordered merges

✔ synthetic feature creation

✔ aggregation after merging

These skills are fundamental for real-world data engineering, EDA, and ML pipelines, where datasets rarely live in one place.

🛠️ Steps Performed

1. Load & inspect `netflix_titles.csv`

Reviewed structure
Parsed dates
Cleaned key columns

2. Generated a synthetic `credits` table

Derived from:

show_id
title
director
cast

Rows with no director and no cast were removed to simulate missing metadata.

3. Generated a synthetic `ratings` table

Created using random sampling:

1–5 star user rating scores
rating counts
random rating_date between 2015–2021

One title may appear multiple times (one-to-many relationship).

4. Performed joins

Examples:

Left join: titles + credits
One-to-many join: titles + ratings
Aggregation: mean rating per title
Concatenation: movies + TV shows
Time-based merge: ratings + marketing signals

📊 Key Insights

Joins with missing values illustrate real challenges in metadata pipelines.
One-to-many joins require aggregation to return to a single-row-per-title structure.
merge_asof() efficiently combines time-series data without exact matches.
concat() is essential for dataset stacking and feature-level engineering.

💡 Why This Project Matters

In real analytics and ML work, most datasets come in multiple tables, and your ability to cleanly join and transform them determines:

data quality
model accuracy
feature engineering efficiency
project maintainability

This project builds the foundation for:

✔ Feature engineering in Week 3
✔ ML preprocessing (Week 4–5)
✔ End-to-end ML pipelines (Week 6–12)

🚀 Next Steps

After this project, you are prepared to:

Aggregate large datasets
Build ML-ready feature tables
Implement time-series merges
Clean and combine multi-source datasets

These skills directly connect to Week 3 EDA, Week 4 Supervised ML, and Week 5–6 Pipelines & Feature Engineering.

📝 Author

Nicolas
Data Science & Machine Learning Engineer in training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Netflix Pandas Mini-Project

Week 2 — Joining & Combining Data with Pandas

📂 Project Structure

📘 Objectives

✔ `merge()` — one-to-one & one-to-many joins

✔ `concat()` — vertical stacking

✔ `merge_asof()` — time-aware joins

✔ `merge_ordered()` — ordered merges

✔ synthetic feature creation

✔ aggregation after merging

🛠️ Steps Performed

1. Load & inspect `netflix_titles.csv`

2. Generated a synthetic `credits` table

3. Generated a synthetic `ratings` table

4. Performed joins

📊 Key Insights

💡 Why This Project Matters

🚀 Next Steps

📝 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
notebook		notebook
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🎬 Netflix Pandas Mini-Project

Week 2 — Joining & Combining Data with Pandas

📂 Project Structure

📘 Objectives

✔ merge() — one-to-one & one-to-many joins

✔ concat() — vertical stacking

✔ merge_asof() — time-aware joins

✔ merge_ordered() — ordered merges

✔ synthetic feature creation

✔ aggregation after merging

🛠️ Steps Performed

1. Load & inspect netflix_titles.csv

2. Generated a synthetic credits table

3. Generated a synthetic ratings table

4. Performed joins

📊 Key Insights

💡 Why This Project Matters

🚀 Next Steps

📝 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✔ `merge()` — one-to-one & one-to-many joins

✔ `concat()` — vertical stacking

✔ `merge_asof()` — time-aware joins

✔ `merge_ordered()` — ordered merges

1. Load & inspect `netflix_titles.csv`

2. Generated a synthetic `credits` table

3. Generated a synthetic `ratings` table

Packages