This semester project focuses on analyzing large datasets using data science processing techniques with Apache Hadoop (version >= 3.0) and Apache Spark (version >= 3.5).
- Familiarization with installing and managing distributed Apache Spark and Apache Hadoop systems.
- Apply modern techniques through Spark APIs for big data analysis.
- Understand the capabilities and limitations of these tools in relation to available resources and configurations.
The Project was hosted on a specially configured environment in the AWS cloud. The Code was developed and tested on Amazon's SageMaker AI Notebooks using S3 buckets for storage.
The assignement Presentation can be seen here: project_eng_2024.pdf
The final report can be seen here: Report.pdf
For more details and code examples, please refer to the Jupyter Notebook: advanced_db_project_2024.ipynb