Skip to content

A personalized movie recommendation system πŸŽ₯ powered by PySpark ⚑ using collaborative filtering 🀝 to deliver spot-on suggestions based on user behavior πŸ“Š. Built for scale. Made for binge-watchers. 🍿

Notifications You must be signed in to change notification settings

gnevercodes/PySparkFlicks_MovieRecommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 Personalized Movie Recommendation System using PySpark & Collaborative Filtering

πŸ“Œ Project 1 of 6 | Pushed as part of my academic + real-world ML portfolio πŸš€

🧠 Overview

In the ever-growing jungle of streaming content, users often get lost in endless scrolls and mediocre suggestions. Our project dives into solving this problem by building a personalized movie recommendation system powered by collaborative filtering and Apache Spark, capable of processing massive datasets and giving spot-on suggestions based on user behavior.

πŸ“ˆ Key Features

  • πŸ’‘ Personalized suggestions based on user-item interaction
  • ⚑ Built with PySpark on Apache Spark for large-scale performance
  • πŸ§ͺ Evaluated using RMSE, precision, and recall
  • 🀝 Scalable, fast, and adaptable to various streaming platforms
  • πŸ”’ Acknowledges bias and privacy issues in recommender systems

πŸ› οΈ Tech Stack

  • Language: Python
  • Frameworks: PySpark, Apache Hadoop (HDFS)
  • Tools: MLlib, Jupyter, VS Code
  • Algorithm: User-based Collaborative Filtering

πŸ“‚ Dataset

  • Contains over 8,000+ user interactions and movie ratings
  • Publicly sourced, includes diverse genres, languages, and release years
  • Preprocessing steps include handling nulls, normalization, and outlier removal

πŸ“Š Dataset

This project uses the (https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) which contains millions of user-movie interactions, ratings, and metadata.

For quick testing, a sample dataset (netflix_titles.csv) is included in the /data folder.

To use the full dataset:

  1. Sign in to Kaggle
  2. Download the dataset from the link above
  3. Place it in the root directory or update the path in the code accordingly

πŸ“Š Results

  • Achieved RMSE = 3.7725 on our baseline implementation
  • Compared with benchmark paper achieving RMSE = 1.0742
  • Insights into how parameter tuning (lambda, iterations, rank) affects performance

πŸ” Research & References

We’ve drawn inspiration and technical strategies from key works including:

_For the full IEEE-style paper, check the documenation folder in this repo :)

🧠 Authors & Credits

Built with ❀️ by a team of graduate students as part of our coursework under the guidance of our incredible supervisor (see acknowledgments in paper). Shoutout to all contributors and cited researchers!

πŸ“Œ Future Work

  • 🧠 Incorporating hybrid models (content + collaborative)
  • πŸ”’ Introducing privacy-preserving mechanisms
  • 🎯 Deploying the system on a cloud platform for live inference

πŸ“Ž License

feel free to fork, star, and remix with credit!

πŸ“ Project Structure

πŸ“¦ PySparkFlicks_MovieRecommender/

|---🧠 code/                  β†’ PySpark code and scripts
β”œβ”€β”€ πŸ“’ notebooks/             β†’ Jupyter Notebooks for exploration
β”œβ”€β”€ πŸ“Š data/                  β†’ Sample Netflix dataset
β”œβ”€β”€ πŸ“„ documentation/         β†’ IEEE paper, diagrams, and references
β”œβ”€β”€ βš™οΈ .github/workflows/     β†’ CI/CD workflows (Python)
β”œβ”€β”€ πŸ“¦ requirements.txt       β†’ Python dependencies
β”œβ”€β”€ πŸ› οΈ setup.py               β†’ Installable package setup (optional)
β”œβ”€β”€ πŸ“˜ README.md              β†’ This very file
└── 🧾 LICENSE                β†’ Open-source license

About

A personalized movie recommendation system πŸŽ₯ powered by PySpark ⚑ using collaborative filtering 🀝 to deliver spot-on suggestions based on user behavior πŸ“Š. Built for scale. Made for binge-watchers. 🍿

Topics

Resources

Stars

Watchers

Forks