Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
models		models
presentation		presentation
results/classification		results/classification
.gitignore		.gitignore
README.md		README.md
project.ipynb		project.ipynb

Repository files navigation

Potentially Hazardous Asteroids Classification and Diameter prediction

This is the repository for the project of Big Data course at La Sapienza.

The project consists of two tasks: the first one is a binary classification to decide if an asteroid is potentially hazardous or not while the second one is a regression problem that tries to predict the asteroids' diameter. For the classification problem, I used two different datasets: the first one contains all the asteroids while the second contains only the asteroids that are Near Earth Objects.

Dataset

The dataset that I used is available on Kaggle at this link. You can also query the dataset on the JPL Website.

Potentially Hazardous Asteroids Classification

For this task I used three different models:

the logistic regression;
the decision tree;
the random forest.

Results

I report only the best result for each model. However, you can find all the results in this folder.

Dataset with all asteroids

Model	Best model name	Accuracy	Precision	Recall	F1-Score
Logistic Regression	logistic_regression_no_std_all	0.99	0.70	0.58	0.64
Decision Tree	decision_tree_oversampled_no_std_all	0.99	0.61	0.99	0.75
Random forest	random_forest_oversampled_no_std_all	0.97	0.54	0.98	0.70

Dataset with only Near Earth Objects

Model	Best model name	Accuracy	Precision	Recall	F1-Score
Logistic Regression	logistic_regression_std_nea_nea	0.95	0.85	0.86	0.86
Decision Tree	dec_tree_no_std_nea_nea	0.99	0.98	0.98	0.98
Random forest	random_forest_oversampled_std_nea_nea	0.91	0.74	0.90	0.81

Diameter prediction

Also in this case I used three different approaches:

the linear regression;
the random forest regressor;
the gradient-boosted tree.

In this case, I've created a dataset with all the asteroids that have the diameter field not null.

Model	Training RMSE	Training R2	Training R2 Adj.	Test RMSE	Test R2	Test R2 Adj.
Linear Regression	5.832	0.614	0.613	6.589	0.559	0.556
Random Forest regressor	3.329	0.874	0.874	6.303	0.597	0.593
Gradient boosted tree	3.198	0.884	0.884	7.006	0.502	0.497

Environment

All the tasks were implemented with PySpark on DataBricks (I used the community plan).

About

Project for Big Data course 2020-2021

Report repository

Releases

No releases published

Packages

Contributors

Languages

Jupyter Notebook 100.0%