Skip to content

gaetanoantonicchio/Distributed-Data-Analysis-and-Mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

Project in Distributed Data Analysis & Mining

U.S. Air Pollution - Data Analysis in Apache Spark

                     

The goal of the project is to analyze, clustering, and classify surveys regarding U.S. air pollution levels recorded from 2000 to 2016 in a distributed, parallel environment.

The data-analysis and all the classification/regression tasks were performed on a dataset having 1.7 million of records using Apache Spark.

Regression was applied to the data to extract and construct an engineered dataset which was then used for classification through Random Forest and clustering (K-Means).

Releases

No releases published

Packages

 
 
 

Contributors