Skip to content
This repository was archived by the owner on Jun 20, 2023. It is now read-only.

Commit 5748d80

Browse files
authored
Initial Commit
1 parent abde30f commit 5748d80

File tree

58 files changed

+3099
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+3099
-0
lines changed
284 KB
Loading
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
"""
2+
Overfitting and Underfitting
3+
4+
- Overfitting:
5+
6+
/ Training Step: great results;
7+
/ Validation Step: poor results;
8+
9+
- Underfitting:
10+
11+
/ Training Steep: poor results;
12+
/ Validation Step: poor results;
13+
14+
PS.: for more information, see the image '0 - Overfittting and Underfitting.png'
15+
"""
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
"""
2+
- Exploratory Analysis Steps:
3+
4+
/ read datas with pandas
5+
/ get the shape
6+
/ get the describe method
7+
/ split the datas into target (y) and features (X)
8+
/ split the target and features variables into train and validation
9+
/ treat missing values
10+
/ check out the columns types and convert them if needed
11+
/ convert categorical variables to numeric ones
12+
/ check out for relationships and correlations (plots with matplotlib and seaborn)
13+
14+
"""
15+
16+
home_data = pd.read_csv('filepath')
17+
18+
19+
# \ count >> number of rows with NON-MISSING values
20+
# \ mean >> mean '-'
21+
# \ std >> standard deviation (how spread the datas are)
22+
# \ min, 25%, 50%, 75% and max >> percentiles
23+
#
24+
# \ low outlier <= Q1 - (1.5 * IQR)
25+
# \ high outlier >= Q3 + (1.5 * IQR)
26+
27+
home_data.describe()
28+
29+
############
30+
31+
32+
"""
33+
- Scikit Learn Steps:
34+
35+
/ Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
36+
37+
/ Fit: Capture patterns from provided data. This is the heart of modeling.
38+
39+
/ Predict: Just what it sounds like
40+
41+
/ Evaluate: Determine how accurate the model's predictions are.
42+
43+
Just a reminder: 'mean_absolute_error' is a loss function and it
44+
measures how good the model is
45+
"""
46+
47+
from sklearn.model_selection import train_test_split
48+
from skelarn.tree import DecisionTreeRegressor
49+
from sklearn.metrics import mean_absolute_error, mean_squared_error
50+
51+
# Defining the target (y) and the feature (X) variables
52+
y = home_data.Price
53+
54+
feature_names = ['Lot', 'Build Year', 'Rooms']
55+
X = home_data[feature_names]
56+
57+
# Splitting up the variables into train and validation
58+
train_X, val_X, train_y, test_y = train_test_split(X, y, random_state=1, test_size=0.3)
59+
60+
# Creating and training the model
61+
home_model = DecisionTreeRegressor(random_state=1)
62+
home_model.fit(train_X, train_y)
63+
64+
# Predicting and evaluating the model
65+
predicted_values = home_model.predict(val_X)
66+
67+
print('First Five Predicted Values: ', predicted_values[0:5])
68+
print('First Five Real Values: ', val_y.head().tolist())
69+
print('\n')
70+
print('Mean Absolute Error (MAE): ', mean_absolute_error(val_y, predicted_values))
71+
print('Mean Squared Error (MSE): ', mean_squared_error(val_y, predicted_values) )
72+
73+
########
74+
75+
"""
76+
Overfitting VS Underfitting
77+
78+
- Overfitting: the model makes good predictions in the training phase,
79+
but makes poor predictions in the validation phase.
80+
When using DecisionTreeRegressor, the model has a bunch of
81+
leaves and a high depth in most cases.
82+
83+
- Underfitting: the model makes poor prediction in both training
84+
and validation phases.
85+
When using DecisionTreeRegressor, the model has a few leaves
86+
and a low depth in most cases.
87+
88+
To avoid this, we can use the 'max_leaf_nodes' parameter in the
89+
Decision Tree constructor and get some tests with different values
90+
for this parameter.
91+
"""
92+
93+
def get_mae(max_leaf, train_X, val_X, train_y, val_y):
94+
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf
95+
, random_state=1)
96+
97+
model.fit(train_X, train_y)
98+
predicted_val = model.predict(val_X)
99+
mae = mean_absolute_error(val_y, predicted_val)
100+
101+
return mae
102+
103+
for i in [5, 50, 500, 5000]:
104+
print(i)
105+
print(get_mae(i, train_X, val_X, train_y, val_y))
106+
print('\n')
107+
108+
# after making the test above, you will know which tree size
109+
# (max_leaf_nodes) gives the lower MAE.
110+
#
111+
# So, you don't have the need to split the data into training
112+
# validation anymore, which means you don't have the validation
113+
# datas anymore and you can consider all the datas as training
114+
# datas.
115+
116+
model = DecisionTreeRegressor(max_leaf_nodes=50
117+
, random_state=0)
118+
model.fit(X, y)
119+
120+
121+
#############
122+
123+
"""
124+
A good alternative for the Decision Tree is the Random Forest
125+
Regressor. This algorithm creates a bunch of Decision Tree, testing
126+
with different numbers of leaves and returning the average result
127+
between the trees.
128+
129+
This techinique helps a lot to minimize the overfitting and the
130+
underfitting when compared to the Decision Tree
131+
"""
132+
133+
from sklearn.ensemble import RandomForestRegressor
134+
135+
rf_model = RandomForestRegressor(random_state=1)
136+
rf_model.fit(train_X, train_y)
137+
138+
predicted_values = rf_model.predict(val_X)
139+
mae = mean_absolute_error(val_y, predicted_values)
140+
141+
print("Random Forest Regressor's MAE: ", mae)
142+
143+
###########
144+
145+
# Random Forest Regressor accepts some parameters
146+
# that can help you to get better results
147+
#
148+
# n_estimators >> number of Decision Trees
149+
# min_samples_split >> how many leaves the Decision Trees will have
150+
# max_depth >> how many subdivisions the Decision Tress will have
151+
# criterion >> I can specify the Loss Function!!
152+
# random_state >> always get the same result when running the model again
153+
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
154+
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
155+
model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
156+
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
157+
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)
158+
159+
models = [model_1, model_2, model_3, model_4, model_5]
160+
161+
##########
162+
163+
# Saving the Model
164+
165+
import pickle
166+
167+
168+
model_filename = 'best_model.pkl'
169+
with open(model_filename, 'wb') as file:
170+
pickle.dump(model1, file)
171+
172+
173+
# Loading the Saved Model
174+
with open(model_filename, 'rb') as file:
175+
pickle_model = pickle.load(file)
81.2 KB
Loading
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
"""
2+
* Always split the datas into train and test before to make
3+
the preprocessing *
4+
5+
- Treating Missing Values:
6+
7+
1 - Drop the Columns or Rows with missing values
8+
9+
/ This techniquee is not so much good because your dataset
10+
will lose some information that can be very useful to train
11+
the model.
12+
"""
13+
14+
# Dropping Columns #
15+
16+
cols_missing_values = [col for col in df_train.columns
17+
if df_train[col].isnull().any()]
18+
19+
df_train.drop(cols_missing_values, axis=1, inplace=True)
20+
df_val.drop(cols_missing_values, axis=1, inplace=True)
21+
22+
# Dropping Rows #
23+
24+
df_train.dropna(inplace=True)
25+
df_test.dropna(inplace=True)
26+
27+
"""
28+
2 - Imputation
29+
30+
/ This method is one of the best because the missing values
31+
are replaced by the mean of the column.
32+
33+
/ Notice that how we use the mean, this method is just valid
34+
for numeric variables or categorical variables that have been
35+
labelled
36+
"""
37+
38+
from sklearn.impute import SimpleImputer
39+
40+
imputer = SimpleImputer()
41+
42+
imputed_df_train = pd.DataFrame(imputer.fit_transform(df_train))
43+
imputed_df_val = pd.DataFrame(imputer.transform(df_val))
44+
45+
# Imputation removes the columns' names, so we have to get
46+
# them back
47+
48+
imputed_df_train.columns = df_train.columns
49+
imputed_df_val.columns = df_val.columns
50+
51+
"""
52+
3 - Extended Imputation
53+
54+
/ the missing values are replaced by the mean of the column
55+
and a new column is added into the dataframe for each column
56+
containing missing values.
57+
58+
/ these new columns just store boolean values informing whether
59+
the row has been imputed (TRUE) or hasn't (FALSE)
60+
"""
61+
62+
# Copying the datasets
63+
df_train_plus = df_train.copy()
64+
df_val_plus = df_val.copy()
65+
66+
# Getting columns with missing values
67+
cols_missing_values = [col for col in df_train.columns
68+
if df_train[col].isnull().any()]
69+
70+
# Adding new columns to indicate imputation
71+
# and setting the values
72+
for col in cols_missing_values:
73+
df_train_plus[col + 'was_missing'] = df_train_plus[col].isnull()
74+
df_val_plus[col + 'was_missing'] = df_val_plus[col].isnull()
75+
76+
# Making the imputation
77+
imputation = SimpleImputer()
78+
79+
df_train_plus = pd.DataFrame(imputer.fit_transform(df_train_plus))
80+
df_val_plus = pd.DataFrame(imputer.transform(df_val_plus))
81+
82+
# Getting the columns names
83+
df_train_plus.columns = df_train.columns
84+
df_val_plus.columns = df_val.columns

0 commit comments

Comments
 (0)