Skip to content

aedoardo/asteroids-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Potentially Hazardous Asteroids Classification and Diameter prediction

This is the repository for the project of Big Data course at La Sapienza.

The project consists of two tasks: the first one is a binary classification to decide if an asteroid is potentially hazardous or not while the second one is a regression problem that tries to predict the asteroids' diameter. For the classification problem, I used two different datasets: the first one contains all the asteroids while the second contains only the asteroids that are Near Earth Objects.

Dataset

The dataset that I used is available on Kaggle at this link. You can also query the dataset on the JPL Website.

Potentially Hazardous Asteroids Classification

For this task I used three different models:

  • the logistic regression;
  • the decision tree;
  • the random forest.

Results

I report only the best result for each model. However, you can find all the results in this folder.

Dataset with all asteroids

Model Best model name Accuracy Precision Recall F1-Score
Logistic Regression logistic_regression_no_std_all 0.99 0.70 0.58 0.64
Decision Tree decision_tree_oversampled_no_std_all 0.99 0.61 0.99 0.75
Random forest random_forest_oversampled_no_std_all 0.97 0.54 0.98 0.70

Dataset with only Near Earth Objects

Model Best model name Accuracy Precision Recall F1-Score
Logistic Regression logistic_regression_std_nea_nea 0.95 0.85 0.86 0.86
Decision Tree dec_tree_no_std_nea_nea 0.99 0.98 0.98 0.98
Random forest random_forest_oversampled_std_nea_nea 0.91 0.74 0.90 0.81

Diameter prediction

Also in this case I used three different approaches:

  • the linear regression;
  • the random forest regressor;
  • the gradient-boosted tree.

In this case, I've created a dataset with all the asteroids that have the diameter field not null.

Model Training RMSE Training R2 Training R2 Adj. Test RMSE Test R2 Test R2 Adj.
Linear Regression 5.832 0.614 0.613 6.589 0.559 0.556
Random Forest regressor 3.329 0.874 0.874 6.303 0.597 0.593
Gradient boosted tree 3.198 0.884 0.884 7.006 0.502 0.497

Environment

All the tasks were implemented with PySpark on DataBricks (I used the community plan).

About

Project for Big Data course 2020-2021

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors