1+ """
2+ - Exploratory Analysis Steps:
3+
4+ / read datas with pandas
5+ / get the shape
6+ / get the describe method
7+ / split the datas into target (y) and features (X)
8+ / split the target and features variables into train and validation
9+ / treat missing values
10+ / check out the columns types and convert them if needed
11+ / convert categorical variables to numeric ones
12+ / check out for relationships and correlations (plots with matplotlib and seaborn)
13+
14+ """
15+
16+ home_data = pd .read_csv ('filepath' )
17+
18+
19+ # \ count >> number of rows with NON-MISSING values
20+ # \ mean >> mean '-'
21+ # \ std >> standard deviation (how spread the datas are)
22+ # \ min, 25%, 50%, 75% and max >> percentiles
23+ #
24+ # \ low outlier <= Q1 - (1.5 * IQR)
25+ # \ high outlier >= Q3 + (1.5 * IQR)
26+
27+ home_data .describe ()
28+
29+ ############
30+
31+
32+ """
33+ - Scikit Learn Steps:
34+
35+ / Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
36+
37+ / Fit: Capture patterns from provided data. This is the heart of modeling.
38+
39+ / Predict: Just what it sounds like
40+
41+ / Evaluate: Determine how accurate the model's predictions are.
42+
43+ Just a reminder: 'mean_absolute_error' is a loss function and it
44+ measures how good the model is
45+ """
46+
47+ from sklearn .model_selection import train_test_split
48+ from skelarn .tree import DecisionTreeRegressor
49+ from sklearn .metrics import mean_absolute_error , mean_squared_error
50+
51+ # Defining the target (y) and the feature (X) variables
52+ y = home_data .Price
53+
54+ feature_names = ['Lot' , 'Build Year' , 'Rooms' ]
55+ X = home_data [feature_names ]
56+
57+ # Splitting up the variables into train and validation
58+ train_X , val_X , train_y , test_y = train_test_split (X , y , random_state = 1 , test_size = 0.3 )
59+
60+ # Creating and training the model
61+ home_model = DecisionTreeRegressor (random_state = 1 )
62+ home_model .fit (train_X , train_y )
63+
64+ # Predicting and evaluating the model
65+ predicted_values = home_model .predict (val_X )
66+
67+ print ('First Five Predicted Values: ' , predicted_values [0 :5 ])
68+ print ('First Five Real Values: ' , val_y .head ().tolist ())
69+ print ('\n ' )
70+ print ('Mean Absolute Error (MAE): ' , mean_absolute_error (val_y , predicted_values ))
71+ print ('Mean Squared Error (MSE): ' , mean_squared_error (val_y , predicted_values ) )
72+
73+ ########
74+
75+ """
76+ Overfitting VS Underfitting
77+
78+ - Overfitting: the model makes good predictions in the training phase,
79+ but makes poor predictions in the validation phase.
80+ When using DecisionTreeRegressor, the model has a bunch of
81+ leaves and a high depth in most cases.
82+
83+ - Underfitting: the model makes poor prediction in both training
84+ and validation phases.
85+ When using DecisionTreeRegressor, the model has a few leaves
86+ and a low depth in most cases.
87+
88+ To avoid this, we can use the 'max_leaf_nodes' parameter in the
89+ Decision Tree constructor and get some tests with different values
90+ for this parameter.
91+ """
92+
93+ def get_mae (max_leaf , train_X , val_X , train_y , val_y ):
94+ model = DecisionTreeRegressor (max_leaf_nodes = max_leaf
95+ , random_state = 1 )
96+
97+ model .fit (train_X , train_y )
98+ predicted_val = model .predict (val_X )
99+ mae = mean_absolute_error (val_y , predicted_val )
100+
101+ return mae
102+
103+ for i in [5 , 50 , 500 , 5000 ]:
104+ print (i )
105+ print (get_mae (i , train_X , val_X , train_y , val_y ))
106+ print ('\n ' )
107+
108+ # after making the test above, you will know which tree size
109+ # (max_leaf_nodes) gives the lower MAE.
110+ #
111+ # So, you don't have the need to split the data into training
112+ # validation anymore, which means you don't have the validation
113+ # datas anymore and you can consider all the datas as training
114+ # datas.
115+
116+ model = DecisionTreeRegressor (max_leaf_nodes = 50
117+ , random_state = 0 )
118+ model .fit (X , y )
119+
120+
121+ #############
122+
123+ """
124+ A good alternative for the Decision Tree is the Random Forest
125+ Regressor. This algorithm creates a bunch of Decision Tree, testing
126+ with different numbers of leaves and returning the average result
127+ between the trees.
128+
129+ This techinique helps a lot to minimize the overfitting and the
130+ underfitting when compared to the Decision Tree
131+ """
132+
133+ from sklearn .ensemble import RandomForestRegressor
134+
135+ rf_model = RandomForestRegressor (random_state = 1 )
136+ rf_model .fit (train_X , train_y )
137+
138+ predicted_values = rf_model .predict (val_X )
139+ mae = mean_absolute_error (val_y , predicted_values )
140+
141+ print ("Random Forest Regressor's MAE: " , mae )
142+
143+ ###########
144+
145+ # Random Forest Regressor accepts some parameters
146+ # that can help you to get better results
147+ #
148+ # n_estimators >> number of Decision Trees
149+ # min_samples_split >> how many leaves the Decision Trees will have
150+ # max_depth >> how many subdivisions the Decision Tress will have
151+ # criterion >> I can specify the Loss Function!!
152+ # random_state >> always get the same result when running the model again
153+ model_1 = RandomForestRegressor (n_estimators = 50 , random_state = 0 )
154+ model_2 = RandomForestRegressor (n_estimators = 100 , random_state = 0 )
155+ model_3 = RandomForestRegressor (n_estimators = 100 , criterion = 'absolute_error' , random_state = 0 )
156+ model_4 = RandomForestRegressor (n_estimators = 200 , min_samples_split = 20 , random_state = 0 )
157+ model_5 = RandomForestRegressor (n_estimators = 100 , max_depth = 7 , random_state = 0 )
158+
159+ models = [model_1 , model_2 , model_3 , model_4 , model_5 ]
160+
161+ ##########
162+
163+ # Saving the Model
164+
165+ import pickle
166+
167+
168+ model_filename = 'best_model.pkl'
169+ with open (model_filename , 'wb' ) as file :
170+ pickle .dump (model1 , file )
171+
172+
173+ # Loading the Saved Model
174+ with open (model_filename , 'rb' ) as file :
175+ pickle_model = pickle .load (file )
0 commit comments