π Project 1 of 6 | Pushed as part of my academic + real-world ML portfolio π
In the ever-growing jungle of streaming content, users often get lost in endless scrolls and mediocre suggestions. Our project dives into solving this problem by building a personalized movie recommendation system powered by collaborative filtering and Apache Spark, capable of processing massive datasets and giving spot-on suggestions based on user behavior.
- π‘ Personalized suggestions based on user-item interaction
- β‘ Built with PySpark on Apache Spark for large-scale performance
- π§ͺ Evaluated using RMSE, precision, and recall
- π€ Scalable, fast, and adaptable to various streaming platforms
- π Acknowledges bias and privacy issues in recommender systems
- Language: Python
- Frameworks: PySpark, Apache Hadoop (HDFS)
- Tools: MLlib, Jupyter, VS Code
- Algorithm: User-based Collaborative Filtering
- Contains over 8,000+ user interactions and movie ratings
- Publicly sourced, includes diverse genres, languages, and release years
- Preprocessing steps include handling nulls, normalization, and outlier removal
This project uses the (https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) which contains millions of user-movie interactions, ratings, and metadata.
For quick testing, a sample dataset (
netflix_titles.csv
) is included in the/data
folder.
To use the full dataset:
- Sign in to Kaggle
- Download the dataset from the link above
- Place it in the root directory or update the path in the code accordingly
- Achieved RMSE = 3.7725 on our baseline implementation
- Compared with benchmark paper achieving RMSE = 1.0742
- Insights into how parameter tuning (lambda, iterations, rank) affects performance
Weβve drawn inspiration and technical strategies from key works including:
_For the full IEEE-style paper, check the documenation folder in this repo :)
Built with β€οΈ by a team of graduate students as part of our coursework under the guidance of our incredible supervisor (see acknowledgments in paper). Shoutout to all contributors and cited researchers!
- π§ Incorporating hybrid models (content + collaborative)
- π Introducing privacy-preserving mechanisms
- π― Deploying the system on a cloud platform for live inference
feel free to fork, star, and remix with credit!
π¦ PySparkFlicks_MovieRecommender/
|---π§ code/ β PySpark code and scripts
βββ π notebooks/ β Jupyter Notebooks for exploration
βββ π data/ β Sample Netflix dataset
βββ π documentation/ β IEEE paper, diagrams, and references
βββ βοΈ .github/workflows/ β CI/CD workflows (Python)
βββ π¦ requirements.txt β Python dependencies
βββ π οΈ setup.py β Installable package setup (optional)
βββ π README.md β This very file
βββ π§Ύ LICENSE β Open-source license