Skip to content
This repository was archived by the owner on Oct 5, 2025. It is now read-only.

Commit 8bcf0e4

Browse files
authored
Merge pull request #3 from jl5000/patch-1
Minor grammar corrections
2 parents 2e4d013 + 0cec0d3 commit 8bcf0e4

File tree

1 file changed

+57
-57
lines changed

1 file changed

+57
-57
lines changed

30-sml.Rmd

Lines changed: 57 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,21 @@ is qualitative, and **regression**, where the output is quantitative.
1010
When two sets of labels, or classes, are available, one speaks of
1111
**binary classification**. A classical example thereof is labelling an
1212
email as *spam* or *not spam*. When more classes are to be learnt, one
13-
speaks of a **multi-class problem**, such as annotation a new *Iris*
13+
speaks of a **multi-class problem**, such as annotation of a new *Iris*
1414
example as being from the *setosa*, *versicolor* or *virginica*
1515
species. In these cases, the output is a single label (of one of the
1616
anticipated classes). If multiple labels may be assigned to each
17-
examples, one speaks of **multi-label classification**.
17+
example, one speaks of **multi-label classification**.
1818

1919
## Preview
2020

2121
To start this chapter, let's use a simple, but useful classification
22-
algorithm, K nearest neighbours (kNN) to classify the *iris*
22+
algorithm, k-nearest neighbours (kNN) to classify the *iris*
2323
flowers. We will use the `knn` function from the `r CRANpkg("class")`
2424
package.
2525

26-
K nearest neighbours works by directly measuring the (euclidean)
27-
distance between observations and infer the class of unlabelled data
26+
K-nearest neighbours works by directly measuring the (Euclidean)
27+
distance between observations and inferring the class of unlabelled data
2828
from the class of its nearest neighbours. In the figure below, the
2929
unlabelled instances *1* and *2* will be assigned classes *c1* (blue)
3030
and *c2* (red) as their closest neighbours are red and blue,
@@ -101,8 +101,8 @@ table(knnres, iris$Species[nw])
101101
mean(knnres == iris$Species[nw])
102102
```
103103

104-
We have omitted and important argument from `knn`, which is the
105-
parameter *k* of the classifier. This value *k* defines how many
104+
We have omitted an important argument from `knn`, which is the
105+
parameter *k* of the classifier. This parameter defines how many
106106
nearest neighbours will be considered to assign a class to a new
107107
unlabelled observation. From the arguments of the function,
108108

@@ -157,16 +157,16 @@ that we need to consider:
157157

158158
In supervised machine learning, we have a desired output and thus know
159159
precisely what is to be computed. It thus becomes possible to directly
160-
evaluate a model using a quantifiable and object metric. For
160+
evaluate a model using a quantifiable and objective metric. For
161161
regression, we will use the **root mean squared error** (RMSE), which
162162
is what linear regression (`lm` in R) seeks to minimise. For
163163
classification, we will use **model prediction accuracy**.
164164

165165
Typically, we won't want to calculate any of these metrics using
166166
observations that were also used to calculate the model. This
167-
approach, called **in-sample error** lead to optimistic assessment of
167+
approach, called **in-sample error** leads to optimistic assessment of
168168
our model. Indeed, the model has already *seen* these data upon
169-
construction, and is does considered optimised the these observations
169+
construction, and is considered optimised for these observations
170170
in particular; it is said to **over-fit** the data. We prefer to
171171
calculate an **out-of-sample error**, on new data, to gain a better
172172
idea of how to model performs on unseen data, and estimate how well
@@ -196,7 +196,7 @@ p <- predict(model, diamonds)
196196

197197
> Challenge
198198
>
199-
> Calculate the root mean squares error for the prediction above
199+
> Calculate the root mean squared error for the prediction above
200200
201201
<details>
202202
```{r}
@@ -208,9 +208,9 @@ rmse_in
208208
</details>
209209

210210
Let's now repeat the exercise above, but by calculating the
211-
out-of-sample RMSE. We are prepare a 80/20 split of the data and use
212-
80% to fit our model predict the target variable (this is called the
213-
**training data**), the price, on the 20% unseen data (the **testing
211+
out-of-sample RMSE. We prepare a 80/20 split of the data and use
212+
80% to fit our model, and predict the target variable (this is called the
213+
**training data**), the price, on the 20% of unseen data (the **testing
214214
data**).
215215

216216
> Challenge
@@ -234,16 +234,16 @@ rmse_out
234234
```
235235
</details>
236236

237-
The values for the out-of-sample RMSE will vary depending on the what
238-
exact split was used. The diamonds is a rather extensive data, and
239-
thus even when building out model using a subset of the available data
237+
The values for the out-of-sample RMSE will vary depending on what
238+
exact split was used. The diamonds is a rather extensive data set, and
239+
thus even when building our model using a subset of the available data
240240
(80% above), we manage to generate a model with a low RMSE, and
241241
possibly lower than the in-sample error.
242242

243243
When dealing with datasets of smaller sizes, however, the presence of
244244
a single outlier in the train and test data split can substantially
245-
influence the model and the RMSE. We can't rely on such an approach an
246-
need a more robust one, where, we can generate and use multiple,
245+
influence the model and the RMSE. We can't rely on such an approach and
246+
need a more robust one where we can generate and use multiple,
247247
different train/test sets to sample a set of RMSEs, leading to a
248248
better estimate of the out-of-sample RMSE.
249249

@@ -258,13 +258,13 @@ The figure below illustrates the cross validation procedure, creating
258258
size of the data permits it). We split the data into 3 *random* and
259259
complementary folds, so that each data point appears exactly once in
260260
each fold. This leads to a total test set size that is identical to
261-
the size as the full dataset but is composed of out-of-sample
261+
the size of the full dataset but is composed of out-of-sample
262262
predictions.
263263

264264
![Schematic of 3-fold cross validation producing three training (blue) and testing (white) splits.](./figure/xval.png)
265265

266266
After cross-validation, all models used within each fold are
267-
discarded, and a new model is build using the whole dataset, with the
267+
discarded, and a new model is built using the whole dataset, with the
268268
best model parameter(s), i.e those that generalised over all folds.
269269

270270
This makes cross-validation quite time consuming, as it takes *x+1*
@@ -282,7 +282,7 @@ the `train` function in `r CRANpkg("caret")`. Below, we apply it to
282282
the diamond price example that we used when introducing the model
283283
performance.
284284

285-
- We start by setting a random to be able to reproduce the example.
285+
- We start by setting a random seed to be able to reproduce the example.
286286
- We specify the method (the learning algorithm) we want to use. Here,
287287
we use `"lm"`, but, as we will see later, there are many others to
288288
choose from[^1].
@@ -348,13 +348,13 @@ assess its accuracy to do so.
348348
### Confusion matrix
349349

350350
Instead of calculating an error between predicted value and known
351-
value, in classification we will directly compare of the predicted
352-
class matches the known label. To do so, rather than calculating the
351+
value, in classification we will directly compare the predicted
352+
class matches with the known label. To do so, rather than calculating the
353353
mean accuracy as we did above, in the introductory kNN example, we can
354354
calculate a **confusion matrix**.
355355

356-
A confusion matrix to contrast predictions to actual results. Correct
357-
results are *true positives* (TP) and *true negatives* that are found
356+
A confusion matrix contrasts predictions to actual results. Correct
357+
results are *true positives* (TP) and *true negatives* (TN) are found
358358
along the diagonal. All other cells indicate false results, i.e *false
359359
negatives* (FN) and *false positives* (FP).
360360

@@ -366,7 +366,7 @@ colnames(cmat) <- c("Reference Yes", "Reference No")
366366
knitr::kable(cmat)
367367
```
368368

369-
The values that populate this table will depend on a the cutoff that
369+
The values that populate this table will depend on the cutoff that
370370
we set to define whether the classifier should predict *Yes* or
371371
*No*. Intuitively, we might want to use 0.5 as a threshold, and assign
372372
every result with a probability > 0.5 to *Yes* and *No* otherwise.
@@ -395,7 +395,7 @@ cl <- ifelse(p > 0.5, "M", "R")
395395
table(cl, test$Class)
396396
```
397397

398-
The caret package offers it's own, more informative function to
398+
The caret package offers its own, more informative function to
399399
calculate a confusion matrix:
400400

401401
```{r soncmat}
@@ -457,8 +457,8 @@ the model performance along all possible thresholds:
457457

458458
- an AUC of 0.5 corresponds to a random model
459459
- values > 0.5 do better than a random guess
460-
- a value 1 represents a perfect model
461-
- a value 1 represents a model that is always wrong
460+
- a value of 1 represents a perfect model
461+
- a value 0 represents a model that is always wrong
462462

463463
### AUC in `caret`
464464

@@ -497,19 +497,19 @@ over-fitting and hence quite popular. They however require
497497
hyperparameters to be tuned manually, like the value *k* in the
498498
example above.
499499

500-
Building random forest starts by generating a high number of
500+
Building a random forest starts by generating a high number of
501501
individual decision trees. A single decision tree isn't very accurate,
502502
but many different trees built using different inputs (with
503-
bootstrapped inputs, features and observations) enable to explore a
503+
bootstrapped inputs, features and observations) enable us to explore a
504504
broad search space and, once combined, produce accurate models, a
505505
technique called *bootstrap aggregation* or *bagging*.
506506

507507
### Decision trees
508508

509509
A great advantage of decision trees is that they make a complex
510510
decision simpler by breaking it down into smaller, simpler decisions
511-
using divide-and-conquer strategy. They basically identify a set of
512-
if-else conditions that split data according to the value if the
511+
using a divide-and-conquer strategy. They basically identify a set of
512+
if-else conditions that split the data according to the value of the
513513
features.
514514

515515

@@ -527,7 +527,7 @@ Decision trees choose splits based on most homogeneous partitions, and
527527
lead to smaller and more homogeneous partitions over their iterations.
528528

529529
An issue with single decision trees is that they can grow, and become
530-
large and complex with many branches, with corresponds to
530+
large and complex with many branches, which corresponds to
531531
over-fitting. Over-fitting models noise, rather than general patterns in
532532
the data, focusing on subtle patterns (outliers) that won't
533533
generalise.
@@ -537,7 +537,7 @@ can happen as a pre-condition when growing the tree, or afterwards, by
537537
pruning a large tree.
538538

539539
- *Pre-pruning*: stop growing process, i.e stops divide-and-conquer
540-
after a certain number of iterations (grows tree at certain
540+
after a certain number of iterations (grows tree to a certain
541541
predefined level), or requires a minimum number of observations in
542542
each mode to allow splitting.
543543

@@ -548,7 +548,7 @@ pruning a large tree.
548548
### Training a random forest
549549

550550
Let's return to random forests and train a model using the `train`
551-
infrastructure from `r CRANpkg("caret")`:
551+
function from `r CRANpkg("caret")`:
552552

553553
```{r loadrange, echo=FALSE, message=FALSE}
554554
suppressPackageStartupMessages(library("ranger"))
@@ -566,10 +566,10 @@ print(model)
566566
plot(model)
567567
```
568568

569-
The main hyperparameters is *mtry*, i.e. the number of randomly selected
570-
variables used at each split. 2 variables produce random models, while
571-
100s of variables tend to be less random, but risk
572-
over-fitting. `caret` automate the tuning of the hyperparameter using a
569+
The main hyperparameter is *mtry*, i.e. the number of randomly selected
570+
variables used at each split. Two variables produce random models, while
571+
hundreds of variables tend to be less random, but risk
572+
over-fitting. The `caret` package can automate the tuning of the hyperparameter using a
573573
**grid search**, which can be parametrised by setting `tuneLength`
574574
(that sets the number of hyperparameter values to test) or directly
575575
defining the `tuneGrid` (the hyperparameter values), which requires
@@ -628,9 +628,9 @@ such cases.
628628
contains a very high proportion of NAs, drop the feature
629629
altogether. These approaches are only applicable when the proportion
630630
of missing values is relatively small. Otherwise, it could lead to
631-
loosing too much data.
631+
losing too much data.
632632

633-
- Impute missing values.
633+
- Impute (replace) missing values.
634634

635635
Data imputation can however have critical consequences depending on the
636636
proportion of missing values and their nature. From a statistical
@@ -679,33 +679,33 @@ methods, that can directly be passed when training the model.
679679

680680
### Median imputation
681681

682-
Imputation using median of features. This methods works well if the
682+
Imputation using median of features. This method works well if the
683683
data are missing at random.
684684

685685
```{r, eval=TRUE}
686686
train(X, Y, preProcess = "medianImpute")
687687
```
688688

689-
Imputing using caret also allows to optimise the imputation based on
689+
Imputing using caret also allows us to optimise the imputation based on
690690
the cross validation splits, as `train` will do median imputation
691691
inside each fold.
692692

693-
### KNN imputation
693+
### kNN imputation
694694

695695
If there is a systematic bias in the missing values, then median
696696
imputation is known to produce incorrect results. kNN imputation will
697-
impute missing values using on other, similar non-missing rows. The
697+
impute missing values using other, similar non-missing rows. The
698698
default value is 5.
699699

700700
```{r, eval=TRUE}
701701
train(X, Y, preProcess = "knnImpute")
702702
```
703703

704-
## Scaling and scaling
704+
## Scaling and centering
705705

706706
We have seen in the *Unsupervised learning* chapter how data at
707707
different scales can substantially disrupt a learning
708-
algorithm. Scaling (division by the standard deviation) and centring
708+
algorithm. Scaling (division by the standard deviation) and centering
709709
(subtraction of the mean) can also be applied directly during model
710710
training by setting. Note that they are set to be applied by default
711711
prior to training.
@@ -715,8 +715,8 @@ train(X, Y, preProcess = "scale")
715715
train(X, Y, preProcess = "center")
716716
```
717717

718-
As we have discussed in the section about Principal component
719-
analysis, PCA can be used as pre-processing method, generating a set
718+
As we have discussed in the section on Principal Component
719+
Analysis, PCA can be used as pre-processing method, generating a set
720720
of high-variance and perpendicular predictors, preventing
721721
collinearity.
722722

@@ -733,9 +733,9 @@ center, scale, pca.
733733
train(X, Y, preProcess = c("knnImpute", "center", "scale", "pca"))
734734
```
735735

736-
The pre-processing methods above represent a classical order or
736+
The pre-processing methods above represent a classical order of
737737
operations, starting with data imputation to remove missing values,
738-
then centring and scaling, prior to PCA.
738+
then centering and scaling, prior to PCA.
739739

740740
<!-- ## Low information predictors -->
741741

@@ -752,7 +752,7 @@ For further details, see `?preProcess`.
752752
## Model selection
753753

754754
In this final section, we are going to compare different predictive
755-
models and chose the best one using the tools presented in the
755+
models and choose the best one using the tools presented in the
756756
previous sections.
757757

758758
To to so, we are going to first create a set of common training
@@ -762,7 +762,7 @@ comparison between the different models.
762762

763763
For this section, we are going to use the `churn` data. Below, we see
764764
that about 15% of the customers churn. It is important to maintain
765-
this proportion in all the folds.
765+
this proportion in all of the folds.
766766

767767
```{r churndata}
768768
library("C50")
@@ -809,7 +809,7 @@ myControl <- trainControl(
809809

810810
### `glmnet` model
811811

812-
The `glmnet` is a liner model with build-in variable selection and
812+
The `glmnet` is a linear model with built-in variable selection and
813813
coefficient regularisation.
814814

815815
```{r glmnetmodel, fig.cap=""}
@@ -950,7 +950,7 @@ little pre-processing.
950950
>
951951
> If you haven't done so, consider pre-processing the data prior to
952952
> training for a model that didn't perform well and assess whether
953-
> pre-processing affected to modelling.
953+
> pre-processing affected the modelling.
954954
955955
<details>
956956
```{r svmmodel2, cache=TRUE, fig.cap=""}

0 commit comments

Comments
 (0)