@@ -10,21 +10,21 @@ is qualitative, and **regression**, where the output is quantitative.
1010When two sets of labels, or classes, are available, one speaks of
1111** binary classification** . A classical example thereof is labelling an
1212email as * spam* or * not spam* . When more classes are to be learnt, one
13- speaks of a ** multi-class problem** , such as annotation a new * Iris*
13+ speaks of a ** multi-class problem** , such as annotation of a new * Iris*
1414example as being from the * setosa* , * versicolor* or * virginica*
1515species. In these cases, the output is a single label (of one of the
1616anticipated classes). If multiple labels may be assigned to each
17- examples , one speaks of ** multi-label classification** .
17+ example , one speaks of ** multi-label classification** .
1818
1919## Preview
2020
2121To start this chapter, let's use a simple, but useful classification
22- algorithm, K nearest neighbours (kNN) to classify the * iris*
22+ algorithm, k- nearest neighbours (kNN) to classify the * iris*
2323flowers. We will use the ` knn ` function from the ` r CRANpkg("class") `
2424package.
2525
26- K nearest neighbours works by directly measuring the (euclidean )
27- distance between observations and infer the class of unlabelled data
26+ K- nearest neighbours works by directly measuring the (Euclidean )
27+ distance between observations and inferring the class of unlabelled data
2828from the class of its nearest neighbours. In the figure below, the
2929unlabelled instances * 1* and * 2* will be assigned classes * c1* (blue)
3030and * c2* (red) as their closest neighbours are red and blue,
@@ -101,8 +101,8 @@ table(knnres, iris$Species[nw])
101101mean(knnres == iris$Species[nw])
102102```
103103
104- We have omitted and important argument from ` knn ` , which is the
105- parameter * k* of the classifier. This value * k * defines how many
104+ We have omitted an important argument from ` knn ` , which is the
105+ parameter * k* of the classifier. This parameter defines how many
106106nearest neighbours will be considered to assign a class to a new
107107unlabelled observation. From the arguments of the function,
108108
@@ -157,16 +157,16 @@ that we need to consider:
157157
158158In supervised machine learning, we have a desired output and thus know
159159precisely what is to be computed. It thus becomes possible to directly
160- evaluate a model using a quantifiable and object metric. For
160+ evaluate a model using a quantifiable and objective metric. For
161161regression, we will use the ** root mean squared error** (RMSE), which
162162is what linear regression (` lm ` in R) seeks to minimise. For
163163classification, we will use ** model prediction accuracy** .
164164
165165Typically, we won't want to calculate any of these metrics using
166166observations that were also used to calculate the model. This
167- approach, called ** in-sample error** lead to optimistic assessment of
167+ approach, called ** in-sample error** leads to optimistic assessment of
168168our model. Indeed, the model has already * seen* these data upon
169- construction, and is does considered optimised the these observations
169+ construction, and is considered optimised for these observations
170170in particular; it is said to ** over-fit** the data. We prefer to
171171calculate an ** out-of-sample error** , on new data, to gain a better
172172idea of how to model performs on unseen data, and estimate how well
@@ -196,7 +196,7 @@ p <- predict(model, diamonds)
196196
197197> Challenge
198198>
199- > Calculate the root mean squares error for the prediction above
199+ > Calculate the root mean squared error for the prediction above
200200
201201<details >
202202``` {r}
@@ -208,9 +208,9 @@ rmse_in
208208</details >
209209
210210Let's now repeat the exercise above, but by calculating the
211- out-of-sample RMSE. We are prepare a 80/20 split of the data and use
212- 80% to fit our model predict the target variable (this is called the
213- ** training data** ), the price, on the 20% unseen data (the ** testing
211+ out-of-sample RMSE. We prepare a 80/20 split of the data and use
212+ 80% to fit our model, and predict the target variable (this is called the
213+ ** training data** ), the price, on the 20% of unseen data (the ** testing
214214data** ).
215215
216216> Challenge
@@ -234,16 +234,16 @@ rmse_out
234234```
235235</details >
236236
237- The values for the out-of-sample RMSE will vary depending on the what
238- exact split was used. The diamonds is a rather extensive data, and
239- thus even when building out model using a subset of the available data
237+ The values for the out-of-sample RMSE will vary depending on what
238+ exact split was used. The diamonds is a rather extensive data set , and
239+ thus even when building our model using a subset of the available data
240240(80% above), we manage to generate a model with a low RMSE, and
241241possibly lower than the in-sample error.
242242
243243When dealing with datasets of smaller sizes, however, the presence of
244244a single outlier in the train and test data split can substantially
245- influence the model and the RMSE. We can't rely on such an approach an
246- need a more robust one, where, we can generate and use multiple,
245+ influence the model and the RMSE. We can't rely on such an approach and
246+ need a more robust one where we can generate and use multiple,
247247different train/test sets to sample a set of RMSEs, leading to a
248248better estimate of the out-of-sample RMSE.
249249
@@ -258,13 +258,13 @@ The figure below illustrates the cross validation procedure, creating
258258size of the data permits it). We split the data into 3 * random* and
259259complementary folds, so that each data point appears exactly once in
260260each fold. This leads to a total test set size that is identical to
261- the size as the full dataset but is composed of out-of-sample
261+ the size of the full dataset but is composed of out-of-sample
262262predictions.
263263
264264![ Schematic of 3-fold cross validation producing three training (blue) and testing (white) splits.] ( ./figure/xval.png )
265265
266266After cross-validation, all models used within each fold are
267- discarded, and a new model is build using the whole dataset, with the
267+ discarded, and a new model is built using the whole dataset, with the
268268best model parameter(s), i.e those that generalised over all folds.
269269
270270This makes cross-validation quite time consuming, as it takes * x+1*
@@ -282,7 +282,7 @@ the `train` function in `r CRANpkg("caret")`. Below, we apply it to
282282the diamond price example that we used when introducing the model
283283performance.
284284
285- - We start by setting a random to be able to reproduce the example.
285+ - We start by setting a random seed to be able to reproduce the example.
286286- We specify the method (the learning algorithm) we want to use. Here,
287287 we use ` "lm" ` , but, as we will see later, there are many others to
288288 choose from[ ^ 1 ] .
@@ -348,13 +348,13 @@ assess its accuracy to do so.
348348### Confusion matrix
349349
350350Instead of calculating an error between predicted value and known
351- value, in classification we will directly compare of the predicted
352- class matches the known label. To do so, rather than calculating the
351+ value, in classification we will directly compare the predicted
352+ class matches with the known label. To do so, rather than calculating the
353353mean accuracy as we did above, in the introductory kNN example, we can
354354calculate a ** confusion matrix** .
355355
356- A confusion matrix to contrast predictions to actual results. Correct
357- results are * true positives* (TP) and * true negatives* that are found
356+ A confusion matrix contrasts predictions to actual results. Correct
357+ results are * true positives* (TP) and * true negatives* (TN) are found
358358along the diagonal. All other cells indicate false results, i.e * false
359359negatives* (FN) and * false positives* (FP).
360360
@@ -366,7 +366,7 @@ colnames(cmat) <- c("Reference Yes", "Reference No")
366366knitr::kable(cmat)
367367```
368368
369- The values that populate this table will depend on a the cutoff that
369+ The values that populate this table will depend on the cutoff that
370370we set to define whether the classifier should predict * Yes* or
371371* No* . Intuitively, we might want to use 0.5 as a threshold, and assign
372372every result with a probability > 0.5 to * Yes* and * No* otherwise.
@@ -395,7 +395,7 @@ cl <- ifelse(p > 0.5, "M", "R")
395395table(cl, test$Class)
396396```
397397
398- The caret package offers it's own, more informative function to
398+ The caret package offers its own, more informative function to
399399calculate a confusion matrix:
400400
401401``` {r soncmat}
@@ -457,8 +457,8 @@ the model performance along all possible thresholds:
457457
458458- an AUC of 0.5 corresponds to a random model
459459- values > 0.5 do better than a random guess
460- - a value 1 represents a perfect model
461- - a value 1 represents a model that is always wrong
460+ - a value of 1 represents a perfect model
461+ - a value 0 represents a model that is always wrong
462462
463463### AUC in ` caret `
464464
@@ -497,19 +497,19 @@ over-fitting and hence quite popular. They however require
497497hyperparameters to be tuned manually, like the value * k* in the
498498example above.
499499
500- Building random forest starts by generating a high number of
500+ Building a random forest starts by generating a high number of
501501individual decision trees. A single decision tree isn't very accurate,
502502but many different trees built using different inputs (with
503- bootstrapped inputs, features and observations) enable to explore a
503+ bootstrapped inputs, features and observations) enable us to explore a
504504broad search space and, once combined, produce accurate models, a
505505technique called * bootstrap aggregation* or * bagging* .
506506
507507### Decision trees
508508
509509A great advantage of decision trees is that they make a complex
510510decision simpler by breaking it down into smaller, simpler decisions
511- using divide-and-conquer strategy. They basically identify a set of
512- if-else conditions that split data according to the value if the
511+ using a divide-and-conquer strategy. They basically identify a set of
512+ if-else conditions that split the data according to the value of the
513513features.
514514
515515
@@ -527,7 +527,7 @@ Decision trees choose splits based on most homogeneous partitions, and
527527lead to smaller and more homogeneous partitions over their iterations.
528528
529529An issue with single decision trees is that they can grow, and become
530- large and complex with many branches, with corresponds to
530+ large and complex with many branches, which corresponds to
531531over-fitting. Over-fitting models noise, rather than general patterns in
532532the data, focusing on subtle patterns (outliers) that won't
533533generalise.
@@ -537,7 +537,7 @@ can happen as a pre-condition when growing the tree, or afterwards, by
537537pruning a large tree.
538538
539539- * Pre-pruning* : stop growing process, i.e stops divide-and-conquer
540- after a certain number of iterations (grows tree at certain
540+ after a certain number of iterations (grows tree to a certain
541541 predefined level), or requires a minimum number of observations in
542542 each mode to allow splitting.
543543
@@ -548,7 +548,7 @@ pruning a large tree.
548548### Training a random forest
549549
550550Let's return to random forests and train a model using the ` train `
551- infrastructure from ` r CRANpkg("caret") ` :
551+ function from ` r CRANpkg("caret") ` :
552552
553553``` {r loadrange, echo=FALSE, message=FALSE}
554554suppressPackageStartupMessages(library("ranger"))
@@ -566,10 +566,10 @@ print(model)
566566plot(model)
567567```
568568
569- The main hyperparameters is * mtry* , i.e. the number of randomly selected
570- variables used at each split. 2 variables produce random models, while
571- 100s of variables tend to be less random, but risk
572- over-fitting. ` caret ` automate the tuning of the hyperparameter using a
569+ The main hyperparameter is * mtry* , i.e. the number of randomly selected
570+ variables used at each split. Two variables produce random models, while
571+ hundreds of variables tend to be less random, but risk
572+ over-fitting. The ` caret ` package can automate the tuning of the hyperparameter using a
573573** grid search** , which can be parametrised by setting ` tuneLength `
574574(that sets the number of hyperparameter values to test) or directly
575575defining the ` tuneGrid ` (the hyperparameter values), which requires
@@ -628,9 +628,9 @@ such cases.
628628 contains a very high proportion of NAs, drop the feature
629629 altogether. These approaches are only applicable when the proportion
630630 of missing values is relatively small. Otherwise, it could lead to
631- loosing too much data.
631+ losing too much data.
632632
633- - Impute missing values.
633+ - Impute (replace) missing values.
634634
635635Data imputation can however have critical consequences depending on the
636636proportion of missing values and their nature. From a statistical
@@ -679,33 +679,33 @@ methods, that can directly be passed when training the model.
679679
680680### Median imputation
681681
682- Imputation using median of features. This methods works well if the
682+ Imputation using median of features. This method works well if the
683683data are missing at random.
684684
685685``` {r, eval=TRUE}
686686train(X, Y, preProcess = "medianImpute")
687687```
688688
689- Imputing using caret also allows to optimise the imputation based on
689+ Imputing using caret also allows us to optimise the imputation based on
690690the cross validation splits, as ` train ` will do median imputation
691691inside each fold.
692692
693- ### KNN imputation
693+ ### kNN imputation
694694
695695If there is a systematic bias in the missing values, then median
696696imputation is known to produce incorrect results. kNN imputation will
697- impute missing values using on other, similar non-missing rows. The
697+ impute missing values using other, similar non-missing rows. The
698698default value is 5.
699699
700700``` {r, eval=TRUE}
701701train(X, Y, preProcess = "knnImpute")
702702```
703703
704- ## Scaling and scaling
704+ ## Scaling and centering
705705
706706We have seen in the * Unsupervised learning* chapter how data at
707707different scales can substantially disrupt a learning
708- algorithm. Scaling (division by the standard deviation) and centring
708+ algorithm. Scaling (division by the standard deviation) and centering
709709(subtraction of the mean) can also be applied directly during model
710710training by setting. Note that they are set to be applied by default
711711prior to training.
@@ -715,8 +715,8 @@ train(X, Y, preProcess = "scale")
715715train(X, Y, preProcess = "center")
716716```
717717
718- As we have discussed in the section about Principal component
719- analysis , PCA can be used as pre-processing method, generating a set
718+ As we have discussed in the section on Principal Component
719+ Analysis , PCA can be used as pre-processing method, generating a set
720720of high-variance and perpendicular predictors, preventing
721721collinearity.
722722
@@ -733,9 +733,9 @@ center, scale, pca.
733733train(X, Y, preProcess = c("knnImpute", "center", "scale", "pca"))
734734```
735735
736- The pre-processing methods above represent a classical order or
736+ The pre-processing methods above represent a classical order of
737737operations, starting with data imputation to remove missing values,
738- then centring and scaling, prior to PCA.
738+ then centering and scaling, prior to PCA.
739739
740740<!-- ## Low information predictors -->
741741
@@ -752,7 +752,7 @@ For further details, see `?preProcess`.
752752## Model selection
753753
754754In this final section, we are going to compare different predictive
755- models and chose the best one using the tools presented in the
755+ models and choose the best one using the tools presented in the
756756previous sections.
757757
758758To to so, we are going to first create a set of common training
@@ -762,7 +762,7 @@ comparison between the different models.
762762
763763For this section, we are going to use the ` churn ` data. Below, we see
764764that about 15% of the customers churn. It is important to maintain
765- this proportion in all the folds.
765+ this proportion in all of the folds.
766766
767767``` {r churndata}
768768library("C50")
@@ -809,7 +809,7 @@ myControl <- trainControl(
809809
810810### ` glmnet ` model
811811
812- The ` glmnet ` is a liner model with build -in variable selection and
812+ The ` glmnet ` is a linear model with built -in variable selection and
813813coefficient regularisation.
814814
815815``` {r glmnetmodel, fig.cap=""}
@@ -950,7 +950,7 @@ little pre-processing.
950950>
951951> If you haven't done so, consider pre-processing the data prior to
952952> training for a model that didn't perform well and assess whether
953- > pre-processing affected to modelling.
953+ > pre-processing affected the modelling.
954954
955955<details >
956956``` {r svmmodel2, cache=TRUE, fig.cap=""}
0 commit comments