(Single-Source Dataset With Generated Auxiliary Tables)
This project demonstrates core data-joining techniques in Pandas using only a
single source file: netflix_titles.csv.
To simulate real multi-table data engineering workflows, we generate two
additional datasets inside the notebook:
- credits — extracted from
directorandcastcolumns - ratings — synthetic user rating events for each title
This allows us to practice a wide range of Pandas merge operations without needing external files.
netflix-pandas-project/
│
├── data/
│ └── netflix_titles.csv # original dataset
│
├── notebook/
│ └── netflix_pandas_project.ipynb # full project notebook
│
├── README.md
This project focuses on mastering:
These skills are fundamental for real-world data engineering, EDA, and ML pipelines, where datasets rarely live in one place.
- Reviewed structure
- Parsed dates
- Cleaned key columns
Derived from:
show_idtitledirectorcast
Rows with no director and no cast were removed to simulate missing metadata.
Created using random sampling:
- 1–5 star user rating scores
- rating counts
- random
rating_datebetween 2015–2021
One title may appear multiple times (one-to-many relationship).
Examples:
- Left join: titles + credits
- One-to-many join: titles + ratings
- Aggregation: mean rating per title
- Concatenation: movies + TV shows
- Time-based merge: ratings + marketing signals
- Joins with missing values illustrate real challenges in metadata pipelines.
- One-to-many joins require aggregation to return to a single-row-per-title structure.
merge_asof()efficiently combines time-series data without exact matches.concat()is essential for dataset stacking and feature-level engineering.
In real analytics and ML work, most datasets come in multiple tables, and your ability to cleanly join and transform them determines:
- data quality
- model accuracy
- feature engineering efficiency
- project maintainability
This project builds the foundation for:
✔ Feature engineering in Week 3
✔ ML preprocessing (Week 4–5)
✔ End-to-end ML pipelines (Week 6–12)
After this project, you are prepared to:
- Aggregate large datasets
- Build ML-ready feature tables
- Implement time-series merges
- Clean and combine multi-source datasets
These skills directly connect to Week 3 EDA, Week 4 Supervised ML, and Week 5–6 Pipelines & Feature Engineering.
Nicolas
Data Science & Machine Learning Engineer in training