|
| 1 | +# ***************************** |
| 2 | +# ** Data Science - Road Map ** |
| 3 | +# ***************************** |
| 4 | +# |
| 5 | +# - Model Project: https://www.kaggle.com/code/dsfelix/spaceship-titanic-competition |
| 6 | +# |
| 7 | + |
| 8 | +---- 0 - Documentation ---- |
| 9 | + |
| 10 | +\ Problem Description (context and main goal) |
| 11 | +\ Files Descriptions (train/validation/test datasets, submissions, how the datasets have been extracted, ...) |
| 12 | +\ Variables (name and description) |
| 13 | +\ Target Variable (name, description and values examples) |
| 14 | +\ Model Evaluation Metric (description, goal, equation and example) |
| 15 | +\ Dataset Limitations |
| 16 | +\ Goals |
| 17 | +\ Setup (tools, packages and commands to install the packages) |
| 18 | +\ Aknowledges |
| 19 | + |
| 20 | + |
| 21 | + |
| 22 | +---- 1 - Descriptive Analysis ---- |
| 23 | + |
| 24 | +\ Import and Set up Libraries, Create Constants |
| 25 | +\ Read Dataset, Parse Dates and Encode Characters if Needed |
| 26 | +\ Check out Dataset Shape |
| 27 | +\ Split Dataset into Features and Target |
| 28 | +\ Check out Missing Values (numbers and plots) |
| 29 | +\ Check out Data Types and Make Conversions if Needed |
| 30 | +\ Check out String Features (convert them to lower case, treat duplicated simblings values and treat typos) |
| 31 | +\ Convert String Features into Categorical Features |
| 32 | +\ Check out Data Leakage (Target Leakage, Train-Test Contamination and Stratification) and Drop Features |
| 33 | +\ Treat Time Series Features |
| 34 | +\ Treat GeoSpatial Features |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +---- 2 - Statistical Analysis ---- |
| 39 | + |
| 40 | +\ Use Describe() Function for Numerical and Categorical Features (add new Descriptive Statistics into it, like Standard Error, Variance, Median, Absolute Median Deviation, Skewness, Kurtosis and Range) |
| 41 | + |
| 42 | +\ Check out the Numerical Features Histogram to see which Distribution they have |
| 43 | + |
| 44 | +\ Check out the Categorical Features Value_Counts to see which Distribution they have |
| 45 | + |
| 46 | +\ Create Frequence and Cross Tables for Categorical Features |
| 47 | + |
| 48 | +\ Check out the Correlation between the Numerical Features (use corr() function in pandas and create HeatMaps Plots; check out the need to exclude Na/Missing Values) |
| 49 | + |
| 50 | +\ Make Hypothesis Testing applying a 95% Confidence Interval (T-Test, Z-Test, ANOVA and Chi-Squared Test) |
| 51 | + |
| 52 | +\ Use Regression Analysis to model the relationship between numerical |
| 53 | +columns and a categorical or numerical target column (Linear Regression, Logistic Regression and K-Means Clusters) |
| 54 | + |
| 55 | +\ Generate Summary Reports (Autoviz and Pandas Report Libraries) |
| 56 | + |
| 57 | + |
| 58 | + |
| 59 | +---- 3 - Datas Transformations ---- |
| 60 | + |
| 61 | +\ Split Dataset into Training and Validation |
| 62 | +\ Filter Good and Bad Labels |
| 63 | +\ Check the Need to use Imputers, Encoders, Label Encoders and Standardizations |
| 64 | +\ Check out for Outliers |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +---- 4 - Features Engineering ---- |
| 69 | + |
| 70 | +\ Mutual Information (MI) |
| 71 | +\ K-Means Clustering and Elbow Method (biggest drop in the plot) |
| 72 | +\ Principal Component Analysis (PCA) |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +---- 5 - Base Models ---- |
| 77 | + |
| 78 | +\ Pipelines (Imputers, Encoders, Label Encoders and Standardizations) |
| 79 | +\ Create Simple Models |
| 80 | +\ Create XGBoost Models |
| 81 | +\ Create Deep Learning Models |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +---- 6 - Evaluating Models ---- |
| 86 | + |
| 87 | +\ Cross-Validation |
| 88 | +\ Evaluation Metric |
| 89 | +\ Overfitting and Underfitting |
| 90 | +\ Choose the Best Model |
| 91 | + |
| 92 | + |
| 93 | + |
| 94 | +---- 7 - Model Explainability ---- |
| 95 | + |
| 96 | +\ Permutation Importance |
| 97 | +\ Summary Plots |
| 98 | +\ Partial Plots |
| 99 | +\ Contribution\Dependence Plots |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | +---- 8 - Making Predictions ---- |
| 104 | + |
| 105 | +\ Make Predictions with Test Dataset |
| 106 | +\ Export Model in Pickle Format |
| 107 | +\ Load Model in Pickle Format |
| 108 | +\ Create a Simple Kernel to use the model and make Predictions |
| 109 | + |
| 110 | + |
| 111 | + |
| 112 | +---- 9 - Reach Me Section ---- |
| 113 | + |
| 114 | +\ E-mail |
| 115 | +\ LinkedIn |
| 116 | +\ Portfolio |
| 117 | +\ GitHub |
| 118 | +\ Kaggle |
0 commit comments