Skip to content

0xNic11/netflix-pandas-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

🎬 Netflix Pandas Mini-Project

Week 2 — Joining & Combining Data with Pandas

(Single-Source Dataset With Generated Auxiliary Tables)

This project demonstrates core data-joining techniques in Pandas using only a single source file: netflix_titles.csv.
To simulate real multi-table data engineering workflows, we generate two additional datasets inside the notebook:

  • credits — extracted from director and cast columns
  • ratings — synthetic user rating events for each title

This allows us to practice a wide range of Pandas merge operations without needing external files.


📂 Project Structure

netflix-pandas-project/
│
├── data/
│ └── netflix_titles.csv # original dataset
│
├── notebook/
│ └── netflix_pandas_project.ipynb # full project notebook
│
├── README.md

📘 Objectives

This project focuses on mastering:

merge() — one-to-one & one-to-many joins

concat() — vertical stacking

merge_asof() — time-aware joins

merge_ordered() — ordered merges

✔ synthetic feature creation

✔ aggregation after merging

These skills are fundamental for real-world data engineering, EDA, and ML pipelines, where datasets rarely live in one place.


🛠️ Steps Performed

1. Load & inspect netflix_titles.csv

  • Reviewed structure
  • Parsed dates
  • Cleaned key columns

2. Generated a synthetic credits table

Derived from:

  • show_id
  • title
  • director
  • cast

Rows with no director and no cast were removed to simulate missing metadata.

3. Generated a synthetic ratings table

Created using random sampling:

  • 1–5 star user rating scores
  • rating counts
  • random rating_date between 2015–2021

One title may appear multiple times (one-to-many relationship).

4. Performed joins

Examples:

  • Left join: titles + credits
  • One-to-many join: titles + ratings
  • Aggregation: mean rating per title
  • Concatenation: movies + TV shows
  • Time-based merge: ratings + marketing signals

📊 Key Insights

  • Joins with missing values illustrate real challenges in metadata pipelines.
  • One-to-many joins require aggregation to return to a single-row-per-title structure.
  • merge_asof() efficiently combines time-series data without exact matches.
  • concat() is essential for dataset stacking and feature-level engineering.

💡 Why This Project Matters

In real analytics and ML work, most datasets come in multiple tables, and your ability to cleanly join and transform them determines:

  • data quality
  • model accuracy
  • feature engineering efficiency
  • project maintainability

This project builds the foundation for:

✔ Feature engineering in Week 3
✔ ML preprocessing (Week 4–5)
✔ End-to-end ML pipelines (Week 6–12)


🚀 Next Steps

After this project, you are prepared to:

  • Aggregate large datasets
  • Build ML-ready feature tables
  • Implement time-series merges
  • Clean and combine multi-source datasets

These skills directly connect to Week 3 EDA, Week 4 Supervised ML, and Week 5–6 Pipelines & Feature Engineering.


📝 Author

Nicolas
Data Science & Machine Learning Engineer in training

About

Pandas mini-project using Netflix titles. Only one dataset is provided (`netflix_titles.csv`), so auxiliary tables (`credits`, `ratings`) are generated programmatically to simulate real multi-table data engineering workflows. Includes examples of merge(), concat(), merge_asof(), and merge_ordered().

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors