-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathpredicting_product_tier.py
More file actions
891 lines (667 loc) · 43 KB
/
predicting_product_tier.py
File metadata and controls
891 lines (667 loc) · 43 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
# -*- coding: utf-8 -*-
"""Predicting-product-tier.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1j7WuWTKZHB8EnaX1yLyRUKzeqi5CdJ8W
**This example demonstrates the prediction of the product tier of cars sold on a website from the information contained in the columns of the data 'Items_Cars_Data.csv'.
The file 'Data_description.csv' describes the columns.**
**The notebook was created by Randa Natras: randa.natras@hotmail.com**
"""
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
"""# **Load data**
The data will be loaded into a pandas DataFrame, which are good tools for manipulating and displaying the data. Date columns are converted to datetime.
"""
df = pd.read_csv ('Items_Cars_Data.csv', delimiter=';', parse_dates=["created_date", 'deleted_date'], dayfirst=True)
description = pd.read_csv ('Data_Description.csv', delimiter=';')
description
"""# **Examine data**
It's good practice to check the data first to get better understanding of data.
"""
df
#Check the data type of each column
df.dtypes
# See the column data types and non-missing values
df.info()
"""From the first data check, the data frame has 11 columns and 38862 rows.
The object columns include a total of 3 columns, of which two columns represent product_tier and make_name, and third one ctr was calculated from two numeric columns, as described in the file "Data_Description.csv".
Total numerical (int64 and float64) columns are 7, and two columns are datetime64 type.
"""
#check if there are missing values
df.isna().sum()
""" Columns 'search_views', 'detail_views', and 'ctr' contain missing values. The missing data need to be handle as many machine learning algorithms do not support missing values. Let's first check where the missing values are."""
np.where(df['search_views'].isnull() == True)[0]
np.where(df['detail_views'].isnull() == True)[0]
np.where(df['ctr'].isnull() == True)[0]
"""There are 10 missing values in search_views and detail_view at the same locations, and 24 missing values in ctr, 10 of which overlap with the location of the missing values in the previously mentioned columns, i.e. locations [10151, 21423, 27830, 47864, 57347, 60752, 63571, 65121, 66492, 72388]. Let's now check the first 10 rows with the missing values in 'ctr'."""
display(df[6738:6739])
display(df[10151:10152])
display(df[19983:19984])
display(df[21423:21424])
display(df[26122:26123])
display(df[27830:27831])
display(df[28823:28824])
display(df[38923:38924])
display(df[43222:43223])
display(df[47864:47865])
"""There is clearly overlap of missing values for all 3 columns (search_views, detail_views, ctr), but ctr column has extra number of missing values. Due to the overlap of missing values in search_views and detail_views with ctr, or 0 values in search_views and detail_views, we cannot use search_views and detail_views to impute ctr here, or vice versa. Some rows have 0 search_views and 0 detail_views, where the stock days are 0 or negative. This may mean that the article was deleted shortly after it was created. I assume that these rows are not so important and meaningful for prediction, since the article was quickly deleted without being viewed, and there are only a few such cases.
Imputing values for search_views and detail_views is a bit challenging, as they depend on the number of users, searches and clicks. Therefore, imputation using methods such as forward or backward filling is not suitable here. Another possible approach could be to try to predict these values using the other columns/features. However, those predicted values would be only a proxy for the true values.
The dataset has in total 78270 rows. Deleting 24 rows should not have much affect to the result comparing to the length of the complete dataset. Therefore, the rows with missing values will be removed in the next step.
"""
#deleting rows with missing values
df.dropna(inplace = True)
df.reset_index(drop=True, inplace=True)
df.isna().sum()
"""Moreover, when checking the data, the error value in ctr is found in the row 849. Therefore it is better to recalculate the column."""
#27.624.309.392.265.100 is error value, it should be 0.27624309392265100
df[849:850]
#because of the problems with values, I will re-calculate again 'ctr' column as the quotient of detail_views over search_views
df['ctr']=df['detail_views'] / df['search_views']
df
"""Looking at previous df table it can be noticed that some values are slightly different in ctr column, e.g. row 1, 78291, 78294. So it was good to recalculate the ctr column.
## **Deriving new features**
We can create new features from time information, for example, the age of cars.
"""
#extract year
df['created_year'] = pd.DatetimeIndex(df['created_date']).year
df['age'] = df ['created_year'] - df ['first_registration_year']
df.head()
"""# **Data exploratory analysis**
Now I will perform data exploratory analysis in order to better understand the data, to discover patterns, to spot anomalies and to check assumptions with the help of statistic summary and graphical representations.
Statistical summary
"""
#Descriptive statistics
df.describe()
"""Numerical data: The distribution for search_views and detail_views looks skewed as mean and median (50%) are not close. Also the maximum value for first_registration_year and the minimum value fo age look anomalous, which will be checked later by plots."""
categ = ['product_tier','make_name']
df[categ].describe()
"""Categorical data: The data in product_tier has 3 unique values with the most common value being "Basic". The data in make_name has 91 unique values with the most common value being "Volkswagen".
Let's plot column product_tier.
"""
sns.countplot(x="product_tier", data = df)
plt.rcParams ['figure.figsize'] = [10,4]
plt.rcParams.update({'font.size': 14})
"""The majority of products are Basic Tier, while Premium and Plus Tier are in the minority. A strong imbalance between the classes can be observed. This can pose a difficulty for a learning algorithm, leading to a biased model for the
majority of cases. With imbalanced data, a machine learning model can achieve high accuracy by predicting only the majority class but not capturing the minority class. Therefore, it is important to consider how to deal with an unbalanced data set. Possible solutions for dealing with imbalanced data include selecting appropriate learning algorithms (e.g., tree-based algorithms and boosting may be more appropriate), training on undersampled or oversampled datasets, generating synthetic data for the minority class, using cost-sensitive solutions that adjust the penalty to the degree of importance assigned to the minority class to penalize errors in the minority class. It is also important to choose an appropriate performance metric to evaluate a model with an imbalanced data set (e.g., confusion matrix, precision, recall, F1 score).
"""
sns.countplot(x="make_name", data = df)
plt.rcParams ['figure.figsize'] = [25.0, 5]
plt.xticks(rotation=90)
plt.show()
"""The above diagram shows which cars are included in the data set. It can be seen that most of the cars are Volkswagen, Renault, Peugeot, Opel, Ford, Mercedes-Benz and BMW. On the right side of the diagram there are many other brands with a small number of products listed."""
sns.countplot(data = df, x="first_registration_year")
plt.xticks(rotation=90)
plt.show()
"""We see that the registration years are from 1924, but after 2000 we see a significant increase in the amount of data. Most items have a registration year between 2010 and 2018. Also, there is a value for first_registration_year "2106", which is an anomaly here. I suspect it should be 2016, but it might be 2006. This could be double checked to be sure what year it is. Since I cannot verify this, I will remove this row since it is only one. First I will check the problematic row."""
is_anomaly=(df['first_registration_year']==2106)
is_anomaly_index=df[is_anomaly].index
i=is_anomaly_index
print (i)
df[36295:36296]
df[36295:36296]
#remove row with outlier
df.drop([36295], inplace = True )
df.reset_index(drop=True, inplace=True)
sns.countplot(data = df, x="age")
plt.rcParams ['figure.figsize'] = [25.0, 5]
plt.xticks(rotation=90)
plt.show()
"""The age goes up to 94, but the majority of cars are between 0 and 5 years, after which is a significant increase in the amount of cars sold on a website. Also, there is a value -1 and -2, which should be checked."""
is_age1=(df['age']==-1)
is_age2=(df['age']==-2)
is_age1_index=df[is_age1].index
is_age2_index=df[is_age2].index
i1=is_age1_index
i2=is_age2_index
print (i1)
print (i2)
display(df[12717:12718])
display(df[35352:35353])
display(df[64665:64666])
df[77639:77640]
"""There are cars that are registered only after they are taken from a website, that is, after they are sold."""
sns.countplot(data = df, x='first_zip_digit')
plt.rcParams ['figure.figsize'] = [20.0, 5]
"""Regarding the first digit of the zip code of the region the product is offered, it can be seen that the most items are offered in regions 3 and 5, while the smallest number of items is offered in region 9."""
numeric=['price', 'first_zip_digit', 'first_registration_year', 'age', 'search_views', 'detail_views', 'stock_days', 'ctr']
features = numeric
plt.figure(figsize=(18, 4))
for i in range(0, len(features)):
plt.subplot(1, 8, i+1)
sns.boxplot(y=df[features[i]], color='purple',orient='v')
plt.tight_layout();
"""In the boxplot graph can be seen that the price, first_registration_year, age, search_views, detail_views and ctr features have many outliers. Logarithmic (log) transformation can be used here to address skewed data effectively.
Several data columns are right skewed, except first_registration_year which is left skewed.
"""
#View data distribution
data_num = df[numeric]
k = len(data_num.columns)
n = 3
m = (k - 1) // n + 1
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
for i, (name, col) in enumerate(data_num.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax, color='red')
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name, color='black')
ax2.set_ylim(0)
fig.tight_layout();
#since detail_views and ctr have min values of 0, I used log(x+1) to avoid -inf values after transformation
#since 'age has min values of -2, I used log(x+3) to avoid -inf values after transformation
df_log=df.copy()
df_log['price'] = np.log10(df['price'])
df_log['first_registration_year'] = np.log10(df['first_registration_year'])
df_log['age'] = np.log10(df['age']+3)
df_log['search_views'] = np.log10(df['search_views']+1)
df_log['detail_views'] = np.log10(df['detail_views']+1)
df_log['ctr'] = np.log10(df['ctr']+1)
#Features after logaritmic transformation
features = numeric
plt.figure(figsize=(18, 4))
for i in range(0, len(features)):
plt.subplot(1, 8, i+1)
sns.boxplot(y=df_log[features[i]], color='purple',orient='v')
plt.tight_layout();
"""After log transformation, distribution of price, age, searc_views, detail_views corresponds better to normal distribution, while ctr and first_registration_year has still lot of outliers."""
#View data distribution with log transformation
data_num = df_log[numeric]
k = len(data_num.columns)
n = 3
m = (k - 1) // n + 1
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
for i, (name, col) in enumerate(data_num.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax, color='red')
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name, color='black')
ax2.set_ylim(0)
fig.tight_layout();
"""Now we can create some scatter plots to see the relationships between the numeric variables in relation to the product tier categories."""
#create a pairplot graph from each numeric data
plt.figure(figsize=(12,5))
sns.pairplot(data=df,
x_vars=['price','first_zip_digit','first_registration_year', 'age', 'detail_views','ctr', 'stock_days'],
y_vars=['search_views'],
hue ='product_tier',
height=5, aspect=0.5);
fig.tight_layout();
#create a pairplot graph from each numeric data
plt.figure(figsize=(12,5))
sns.pairplot(data=df_log,
x_vars=['price','first_zip_digit','first_registration_year', 'age', 'detail_views','ctr', 'stock_days'],
y_vars=['search_views'],
hue ='product_tier',
height=5, aspect=0.5);
fig.tight_layout();
"""The Basic product level has more searches, regardless of price or days in stock. """
df_plot = df.drop(['article_id', 'make_name', 'created_date', 'deleted_date', 'created_year'], axis=1)
df_log_plot = df_log.drop(['article_id', 'make_name', 'created_date', 'deleted_date', 'created_year'], axis=1)
sns.pairplot(df_plot, hue ='product_tier', palette="tab10")
sns.pairplot(df_log_plot, hue ='product_tier', palette="tab10")
"""Patterns between detail_views and first_zip_digit can be more clearly visible in the graph below."""
sns.catplot(data=df, y='detail_views', x='first_zip_digit', hue='product_tier')
sns.catplot(data=df, y='search_views', x='first_zip_digit', hue='product_tier')
sns.catplot(data=df, y='detail_views', x='first_registration_year', hue='product_tier', height=5, aspect=4)
plt.xticks(rotation=90)
plt.show()
sns.catplot(data=df, y='detail_views', x='age', hue='product_tier', height=5, aspect=4)
plt.xticks(rotation=90)
plt.show()
"""On the graph above it can be clearly seen that more detail_views apply to articles with registration_year from 2000 with the peak in 2014 and 2015, i.e. for the cars with registration 5-6 years old (relative to the last article offered in 2020)."""
sns.catplot(data=df, y='product_tier', x='first_registration_year')
"""Basic articles have registration year from 1925, while Premium and Plus articles have newer registration year with the majority of articles from year 2000 onwards."""
sns.catplot( data = df, x='make_name', hue='product_tier', kind='count', height=7, aspect=3)
plt.xticks(rotation=90)
plt.show()
sns.catplot( data = df, x='make_name', y='price', hue='product_tier', kind='box', height=7, aspect=4)
plt.xticks(rotation=90)
plt.show()
"""The graph above shows that Premium and Plus articles have sometimes larger price than Basic articles, for example Tesla, Land Rover, Porsche. The graph below shows that the overall prices between diffferent product_tier are not so much different for the majority of articles. All three product_tiers have otuliers, where the largest are for the Basic products and the smallest for the Plus articles."""
sns.catplot( data = df, x='price', y='product_tier', kind='box', height=5, aspect=2)
plt.xticks(rotation=90)
plt.show()
sns.catplot( data = df, x='make_name', y='detail_views', hue='product_tier', kind='box', height=7, aspect=4)
plt.xticks(rotation=90)
plt.show()
"""On the graph above it can be seen that Premium type articles mostly have more detail_views, while Basic articles have larger outliers. This is also confirmed with the plot below, which showes larger median and maximum detail_views for the Premium and Plus articles."""
sns.catplot( data = df, x='detail_views', y='product_tier', kind='box', height=5, aspect=4)
plt.xticks(rotation=90)
plt.show()
sns.catplot( data = df, x='make_name', hue='product_tier', y='stock_days', kind='box', height=7, aspect=3)
plt.xticks(rotation=90)
plt.show()
"""The graph above shows that the Plus articles have generally larger stock_days then the Basic articles. However this depends also on make_name. Basic products from right side have mostly larger stock_days. Similar can be seen on the box plot below with median stock_days of about 20 days for Premium and of about 30 stock days for Plus articles."""
sns.catplot( data = df, x='stock_days', y='product_tier', kind='box', height=5, aspect=2)
plt.xticks(rotation=90)
plt.show()
"""Based on data exploratory analysis the following conclusions can be drawn:
1. Most of data are skewed. Log transformation can be useful to address skewed data effectively.
2. Product_tier is highly imbalanced with the majority class Basic, while Premium and Plus are minority classes. It is important to consider how to deal with an imbalanced dataset, for example: appropriate learning algorithms, data sampling, cost-sensitive solutions, appropriate performance metric.
3. In feature make_name the majority class is Volkswagen.
5. Prediction of product_tier: data analysis shows some relations between the target product_tier and features such as first_registration_year / age, detail_views, stock_days, price.
# **Predicting Product tier**
## **Preparing data for training**
"""
#Encoding categorical features. I chose label encoding because the number of categories in make_name is quite large and categories in product tier might be ordinal
from sklearn.preprocessing import LabelEncoder
# Creating a instance of label Encoder.
label_encoder = LabelEncoder()
# Encode labels in columns 'product_tier' and 'make_name'.
df['product_tier']= label_encoder.fit_transform(df['product_tier'])
df['make_name']= label_encoder.fit_transform(df['make_name'])
df_log['product_tier']= label_encoder.fit_transform(df_log['product_tier'])
df_log['make_name']= label_encoder.fit_transform(df_log['make_name'])
"""Features created_date, deleted_date and created_year will be dropped, becasue they are not useful for predicting task."""
df = df.drop(['created_year', 'created_date', 'deleted_date'], axis=1)
df.head()
df_log.head()
c = np.round(df.corr(), 2)
plt.figure(figsize=(12,12))
sns.heatmap(c, annot=True, vmin=-1, vmax=1, cmap='coolwarm', square=True)
plt.rcParams.update({'font.size':12})
#containing log features
c = np.round(df_log.corr(), 2)
plt.figure(figsize=(12,12))
sns.heatmap(c, annot=True, vmin=-1, vmax=1, cmap='coolwarm', square=True)
plt.rcParams.update({'font.size':12})
"""The highest correlations to target product_tier have features search_views and detail_views with about 0.2-0.3 indicating weak positive linear relationships. Other features show no linear relationships. However, in data exploratory analysis we could detect some paterns to other features (first_registration_year, stock_days, prices), which could have nonlinear relationship.
The features "registration_year" and "age" are completely collinear, and the "age" is derived from the "registration year". Therefore, one of these features should be used. Since age is a more general characteristic and better fits the normal distribution after the log transformation, the feature registration_year is omitted in the next step. Feature 'article_id' will be also omitted.
"""
#split data into X and y
X = df.loc[:, ['make_name', 'price', 'first_zip_digit', 'age','search_views', 'detail_views', 'ctr', 'stock_days']]
y = df['product_tier']
X_log = df_log.loc[:, ['make_name', 'price', 'first_zip_digit', 'age','search_views', 'detail_views', 'ctr', 'stock_days']]
y_log = df_log['product_tier']
# storing column names in cols
cols = X.columns
X.head()
X_log.head()
# split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size = 0.3, random_state=100)
# split data into train and test
from sklearn.model_selection import train_test_split
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, y_log, train_size=0.7, test_size = 0.3, random_state=100)
from collections import Counter
Counter(y_train)
from collections import Counter
Counter(y_train_log)
Counter(y_test)
Counter(y_test_log)
"""## **Model Building, Training and Evaluation**
### **Dummy baseline**
I will first create a dummy baseline, representing a classifier which will always predict the majority class Basic. As scoring, I use accuracy, balanced accuracy (defined as the average accuracy obtained on each class individually) and F1 scores for multi-class classification. Macro F1-score: calculates F1-score for all classes individually and then average them, which is well suited for imbalanced cases. Micro F1-score: calculates F1-score of the aggregated contributions of all classes.
"""
from sklearn.dummy import DummyClassifier
# define model
dummy_clf = DummyClassifier(strategy='most_frequent')
# define scoring
scoring = ['accuracy', 'balanced_accuracy', 'f1_micro', 'f1_macro']
# evaluate model
cv_result = cross_validate(dummy_clf, X_train, y_train, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score of a dummy classifier: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score of a dummy classifier: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score of a dummy classifier: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score of a dummy classifier: {cv_result['test_f1_macro'].mean():.3f}")
index = []
scores = {"Accuracy": [], "Balanced accuracy": [], "F1 macro score": []}
index += ['Dummy classifier']
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())
scores["F1 macro score"].append(cv_result["test_f1_macro"].mean())
df_scores = pd.DataFrame(scores, index=index)
df_scores
"""The dummy classifer achieves high accuracy of 0.96, because the majority of the data belongs to the Basic class. However, the dummy classifier cannot output other classes. This shows that the accuracy metric cannot be used for an imbalanced dataset. On the other hand, the balanced accuracy and F1 macro score show poor performance with just 0.33.
### **Logistic Regression**
No log-transformed features
"""
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# define model
lr_clf = make_pipeline(StandardScaler(), LogisticRegression())
# evaluate model
cv_result = cross_validate(lr_clf, X_train, y_train, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Logistic regression"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""Logistic regression achieves slight improvement in the balanced accuracy for ~0.04 and in F1 macro score for ~0.07.
Let's now try Logistic regression with balanced class weights. Here, the class weights of the majority and manority classes are modified so that the classes are assigned weights that are inversely proportional to the their respective frequencies.
"""
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
# define model
lrbw_clf = make_pipeline(StandardScaler(), LogisticRegression(class_weight='balanced'))
# evaluate model
cv_result = cross_validate(lrbw_clf, X_train, y_train, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Logistic regression, balanced class weights"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""Logistic regresion with balanced weights resulted in a more significant improvement in balanced accuracy with a score of 0.56. The F1 macro score, which takes into account both the precision and recall, shows a bit less improvement achieving a score of 0.42.
### **Logistic Regression with log-features**
with log-transformed features
"""
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# define model
lr_clf = make_pipeline(StandardScaler(), LogisticRegression())
# evaluate model
cv_result = cross_validate(lr_clf, X_train_log, y_train_log, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Logistic regression log"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""Logistic regression with logarithmic features achieves improvement over logistic regression without logarithmic features in balanced accuracy and F1 macro score of about 0.07.
Let's now try again Logistic regression with balanced class weights.
"""
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
# define model
lrbw_clf = make_pipeline(StandardScaler(), LogisticRegression(class_weight='balanced'))
# evaluate model
cv_result = cross_validate(lrbw_clf, X_train_log, y_train_log, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Logistic regression log, balanced class weights"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""The logistic regression with balanced weights and logarithmic features resulted in worse performance in accuracy and F1 macro score than the logistic regression with balanced weights and non-logarithmic features, while the balanced accuracy was slightly improved (~0.01).
### **Random Forest**
I will try next multi-tree classifier. With this type of classifier, it is not needed to scale the numerical data. Random forest is proven to be powerful algorithm for classification. It is quite fast / computational efficient, fast to optimize and it can be used for multi-class classification. Also, using a decision tree-based algorithm, the contribution of the features to the model prediction can be easily estimated to provide an insight into what the model has learned and which predictors are important. The dataset is not huge to require deep learning, so the problem could be solved with classical machine learning algorithms.
"""
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
# evaluate model
cv_result = cross_validate(rf_clf, X_train, y_train, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Random forest"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""Random Forest improves the scores compared to logistic regression and logistic regression with balanced class weights. However, the logistic regression with balanced class weights still has higher balanced accuracy. We will see if Random Forest with balanced class weights can provide further improvements."""
rf_clf_2 = RandomForestClassifier(class_weight='balanced')
# evaluate model
cv_result = cross_validate(rf_clf_2, X_train, y_train, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Random forest, balanced class weights"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""Results show that Random forest with balanced class weights caused a slight degradation of results compared to the "normal" Random forest model. Let's now try Balanced Random forest classifier, which randomly under-samples each boostrap sample to balance it."""
#Random forest method applying random-under data sampling to balance the different bootstraps
from imblearn.ensemble import BalancedRandomForestClassifier
rfb_clf = BalancedRandomForestClassifier()
# evaluate model
cv_result = cross_validate(rfb_clf, X_train, y_train, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Balanced Random forest"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""Balanced Random Forest achieves the highest balanced accuracy of ~0.7 and a F1 score of 0.5. The overall accuracy has also decreased to ~0.8. However, compared to the logistic regression with balanced class weights, there is an improvement in balanced accuracy, but the overall accuracy and F1 macro score are slightly lower.
Let's now try the Balanced Random forest with log features.
"""
#Random forest method applying random-under data sampling to balance the different bootstraps
from imblearn.ensemble import BalancedRandomForestClassifier
rfb_clf_log = BalancedRandomForestClassifier()
# evaluate model
cv_result = cross_validate(rfb_clf_log, X_train_log, y_train_log, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Balanced Random forest log"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""There is not much difference between the last two models. The Balanced Random Forest performs superior than other algortithm taking into account all balanced accuracy and F1 macro score. Therefore, the balanced Random forest with non-logarithmic features will be used for fine-tunning.
### Hyperparameters tunning
Now I will do some simple fine-tuning for the Balanced Random forest to try to further improve its performance.
"""
#parameters tuning
max_depth = [5, 10, 15, 20, None]
for count in max_depth:
cv_result = cross_validate(BalancedRandomForestClassifier(max_depth = count), X_train, y_train, cv=5, scoring=scoring)
print(f'For max depth: {count}')
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
#parameters tuning
estimators = [100, 200, 300, 400, 500, 750, 1000]
for count in estimators:
cv_result = cross_validate(BalancedRandomForestClassifier(n_estimators = count), X_train, y_train, cv=5, scoring=scoring)
print(f'For estimators: {count}')
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
#parameters tuning
estimators = [1500, 2000, 2500, 3000, 5000]
for count in estimators:
cv_result = cross_validate(BalancedRandomForestClassifier(n_estimators = count), X_train, y_train, cv=5, scoring=scoring)
print(f'For estimators: {count}')
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
#Random forest method applying random-under data sampling to balance the different bootstraps
from imblearn.ensemble import BalancedRandomForestClassifier
rfb_clf = BalancedRandomForestClassifier(n_estimators=1000)
# evaluate model
cv_result = cross_validate(rfb_clf, X_train, y_train, cv=5, scoring=scoring)
# summarize performance
print(f"Accuracy score: {cv_result['test_accuracy'].mean():.3f}")
print(f"Balanced accuracy score: {cv_result['test_balanced_accuracy'].mean():.3f}")
print(f"F1 micro score: {cv_result['test_f1_micro'].mean():.3f}")
print(f"F1 macro score: {cv_result['test_f1_macro'].mean():.3f}")
index = ["Balanced Random forest tuned"]
scores["Accuracy"]=cv_result["test_accuracy"].mean()
scores["Balanced accuracy"]=cv_result["test_balanced_accuracy"].mean()
scores["F1 macro score"]=cv_result["test_f1_macro"].mean()
df2 = pd.DataFrame(scores, index=index)
df_scores = df_scores.append(df2)
df_scores
"""The tuned model offers slight improvements in accuracy, balanced accuracy and F1 macro rating."""
model_rfb_clf=rfb_clf.fit(X_train, y_train)
y_pred_rfb_train = model_rfb_clf.predict(X_train)
## Calculate the confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, y_pred_rfb_train, labels=[0, 1, 2])
#Plotting the confusion matrix for Balanced Random forest
cm_rfb = confusion_matrix(y_train, y_pred_rfb_train)
cm_df_rfb = pd.DataFrame(cm_rfb, index = ['Basic','Plus','Premium'], columns = ['Basic','Plus','Premium'])
#Balanced Random forest tuned
plt.figure(figsize=(5,5))
sns.heatmap(cm_df_rfb, annot=True)
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
plt.show()
"""The confusion matrix for the Balanced Random forest model on training data has the highest number of true positives in each row. For minority class Plus are only true positives, while for class Premium about 20% of data are false predictions. Also, about 25% of Basic class data were predicted as Plus class on training dataset.
### Learning curves
Let's now plot learning curves for the final final Balanced Random forest model.
"""
from sklearn.model_selection import learning_curve
def learning_curves(estimator, features, target, train_sizes, cv, scoring):
train_sizes, train_scores, validation_scores = learning_curve(estimator, features, target, train_sizes = train_sizes, cv = cv, scoring = scoring)
train_scores_mean = train_scores.mean(axis = 1)
validation_scores_mean = validation_scores.mean(axis = 1)
plt.plot(train_sizes, train_scores_mean, label = 'Training accuracy')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation accuracy')
plt.ylabel('Accuracy', fontsize = 16)
plt.xlabel('Training set size', fontsize = 16)
title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'
plt.title(title, fontsize = 18, y = 1.03)
plt.legend()
plt.rcParams.update({'font.size': 16})
plt.ylim(0.4, 1)
plt.figure(figsize = (16,5))
train_sizes = [1, 100, 500, 2000, 5000, 10000, 20000, 30000, 40000, 43845]
learning_curves(model_rfb_clf, X_train, y_train, train_sizes, 5, 'accuracy')
"""The learning curves above show high overall accuracy for both the training and validation datasets, which increases as the training set size increases, and after at a dataset size of ~30000 and more, the accuracies are almost the same and become constant."""
plt.figure(figsize = (16,5))
train_sizes = [1, 100, 500, 2000, 5000, 10000, 20000, 30000, 40000, 43845]
learning_curves(model_rfb_clf, X_train, y_train, train_sizes, 5, 'balanced_accuracy')
"""The learning curves for the balanced accuracy show a high training accuracy (>0.8) as soon as the data set size increases to 5000 and more. The validation curve increases from ~0.5 at beginning (dataset size of 1) to ~0.65 at a dataset size of 100000. After that, the line continues to rise steadily. The difference between the training curve and the validation curve is about 0.15 at the end.
### Impurity-based feature importance
Now I will plot impurity-based feature importance to check impact of features on the model output for Balanced Random Forest tuned model.
"""
features = cols
importances = model_rfb_clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in model_rfb_clf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=features)
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.rcParams ['figure.figsize'] = [10,6]
"""Plot shows the highest importance for input features such as search_views, detail_views, stock_days and ctr.
## **Final Model Evaluation (Test data)**
For final testing, I choose Balanced Random Forest tuned model.
"""
## Predict your test set on the trained model Balanced Random forest
y_pred_rfb = model_rfb_clf.predict(X_test)
"""### Accuracy / F1 score"""
#Balanced Random forest
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
print(f"Accuracy score: {accuracy_score(y_test, y_pred_rfb):.2f}")
print(f"Accuracy score balanced: {balanced_accuracy_score(y_test, y_pred_rfb):.2f}")
print(f"F1 micro score: {f1_score (y_test, y_pred_rfb, average='micro'):.2f}")
print(f"F1 macro score: {f1_score (y_test, y_pred_rfb, average='macro'):.2f}")
"""Test results show balanced accuracy score of 0.66, while the F1 macro score is 0.46.
### Precision/Recall
"""
#Balanced Random forest
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
print(f"Precision score macro: {precision_score(y_test, y_pred_rfb, average='macro'):.2f}")
print(f"Precision score micro: {precision_score(y_test, y_pred_rfb, average='micro'):.2f}")
print(f"Recall score macro: {recall_score(y_test, y_pred_rfb, average='macro'):.2f}")
print(f"Recall score micro: {recall_score(y_test, y_pred_rfb, average='micro'):.2f}")
"""Higher recall suggests having less false negative, i.e. a large fraction of truly positive elements are captured. Higher precision suggests that a model has less false positives, and thus providing more relevant results.
### Confusion matrix
"""
#Balanced Random forest
## Calculate the confusion matrix
confusion_matrix(y_test, y_pred_rfb, labels=[0, 1, 2])
"""Let's visualize the confusion matrices below."""
# Creating a confusion matrix,which compares the y_test and y_pred
cm_rfb = confusion_matrix(y_test, y_pred_rfb)
# Creating a dataframe for a array-formatted Confusion matrix,so it will be easy for plotting.
cm_df_rfb = pd.DataFrame(cm_rfb, index = ['Basic','Plus','Premium'], columns = ['Basic','Plus','Premium'])
#Plotting the confusion matrix for Balanced Random forest
#Balanced Random forest tuned
plt.figure(figsize=(5,5))
sns.heatmap(cm_df_rfb, annot=True, cmap= 'OrRd')
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
plt.show()
"""The confusion matrix for the balanced random forest model for the test data set has the highest number of correct-positive predictions in each row. However, for the Basic and Premium classes, there are a large number of false predictions, e.g., Basic is predicted to be Plus or Premium is predicted to be Plus. The number of false predictions for the Plus category is significantly lower.
### Permutation feature importance
Permutation feature importance overcomes the limitations of impurity-based feature importance: it has no bias towards high-cardinality features and can be computed on a hold-out test set.
"""
# Balanced Random forest tuned
# Permutation feature importance on train dataset
from sklearn.inspection import permutation_importance
result = permutation_importance(model_rfb_clf, X_train, y_train, n_repeats=5, random_state=42, n_jobs=2)
forest_importances = pd.Series(result.importances_mean, index=features)
#Permutation feature importance on train dataset
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Permutation feature importance (train data)")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.rcParams.update({'font.size': 14})
plt.show()
# Balanced Random forest tuned
# Permutation feature importance on test dataset
result = permutation_importance(model_rfb_clf, X_test, y_test, n_repeats=5, random_state=42, n_jobs=2)
forest_importances = pd.Series(result.importances_mean, index=features)
#Permutation feature importance on test dataset
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Permutation feature importance (test data)")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.rcParams.update({'font.size': 14})
plt.show()
"""The permutation feature importance plot on the training and test datasets shows the highest importance of the feature 'stock_days'. Other important features are 'detail_views' and 'search_views'. It is interesting to see that the feature 'age' has almost no contribution in the test data, while it has a low contribution in the training data.
## **Conclusion**
* Product tier classes are highly imbalanced, making it difficult to predict minority classes. Using techniques to balance dataset/classes led to significant improvement in learning and generalizing, resulting in more balanced accuracy and higher F1 macro score.
* Using methods of balanced boostrap sampling with Random forest led to better predictions of minority classes. There are still false predictions that can be further adressed to potentially improve accuracy.
* Recall macro score is higher than precision, which means that there are fewer false negative, i.e., a large fraction of true positive are captured, but the model has more false positives.
* Final balanced accuracy is 0.70.
* The permutation analysis of feature importance shows that "stock_days", "search_views" and "detail_views" are the most important predictors for the final Balanced Random Forest model.
"""