Physalia_ML/scripts/4.classification.Rmd at main · JustinePa/Physalia_ML · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "classification problems"
author: "Filippo Biscarini"
date: "7/7/2021"
output: html_document
---

```{r setup, include=FALSE}
library("knitr")
library("ggplot2")
library("reshape2")
library("tidyverse")
library("data.table")
knitr::opts_chunk$set(echo = TRUE)
```

## Our first classifier!

$$reminder: classifier = \text{predictive machine for classification problems} $$

### Reading the data

For this practical session on logistic regression we are using a dataset on the relationship between cleft lip in dogs (Nova Scotia Duck Tolling Retriever, NSDTR) and SNP genotypes ([data](https://datadryad.org/stash/dataset/doi:10.5061/dryad.j8r8q); [paper](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005059)).

The public dataset downloaded from Dryad is a *Plink* `.tped/.tfam` file. The data have been preprocessed:

-   filtered (SNP quality, call-rate, MAF)
-   imputed (imputation of missing genotypes using LHCI: localised haplotype-clustering imputation)
-   selected (only SNPs on chromosomes 25, 26, 27, 28, 29)

We begin by using a reduced version of the dataset, where $0.5\%$ of the SNP loci have been randomly picked:

```{r label="reading_data"}
dogs <- fread("../data/dogs_imputed_reduced.raw")
dogs <- dogs %>%
  select(-c(IID,FID,PAT,MAT,SEX))

## values for logistic regression in glm() must be 0 - 1
dogs$PHENOTYPE = dogs$PHENOTYPE-1
head(dogs)
print(nrow(dogs))
print(ncol(dogs)-1)
```

The dataset contains data on `r nrow(dogs)` and `r ncol(dogs)-1` SNP genotypes. Controls (dogs without cleft lip) are coded as `1`'s, cases (dogs with cleft lip) are coded as `2`'s:

```{r}
dogs %>%
  group_by(PHENOTYPE) %>%
  summarise(N=n())
```

```{r}
maf_controls <- colSums(dogs[dogs$PHENOTYPE==0,-c(1,2)])/(2*nrow(dogs[dogs$PHENOTYPE==0,]))
maf_cases <- colSums(dogs[dogs$PHENOTYPE==1,-c(1,2)])/(2*nrow(dogs[dogs$PHENOTYPE==1,]))
```

```{r}
maf <- data.frame("cases"=maf_cases, "controls"=maf_controls)
mD <- reshape2::melt(maf)
ggplot(mD, aes(y = value, x = variable)) + geom_boxplot(aes(fill=variable)) + xlab(element_blank()) + ylab("MAF")
```

## Fitting the logistic regression model

We now split the data into training and test set, and then fit the logistic regression model to the training set. We have an **unbalanced binomial dataset** (13 cases, 112 controls, tot = 125), therefore we may want to keep the same proportion of cases and controls when sampling the training and the test sets. To do so, we can use what is called "**stratified sampling**":

```{r logistic, echo=FALSE}
seed = 121
set.seed(seed)

dogs$id <- paste("id",seq(1,nrow(dogs)), sep="_")

training_set <- dogs %>%
  group_by(PHENOTYPE) %>%
  sample_frac(size = 0.7)

test_recs <- !(dogs$id %in% training_set$id)
test_set <- dogs[test_recs,]

training_set$id <- NULL
test_set$id <- NULL

table(training_set$PHENOTYPE)
table(test_set$PHENOTYPE)
```

In *R*, you can use the function `glm()` (generalised linear model) that allows you to specify the distribution (binomial in this case) and the link function (logit, in this case).

```{r, warning=FALSE}
glm.fit <- glm(PHENOTYPE ~ ., data = training_set, family=binomial(link="logit"))
```

## Making predictions

To make predictions, we apply the fitted model to the test data. We get probabilities of being a case ($$p(y=1|x)$$), and we need to set a threshold to discriminate between cases (1's) and controls (0's): a common choice for the threshold is **0.5**.

```{r}
glm.probs <- predict(glm.fit, newdata = test_set[,-1],type="response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(glm.pred,test_set$PHENOTYPE)
```

```{r}
error = mean(glm.pred!=test_set$PHENOTYPE)
print(error)
```

```{r}
accuracy = 1 - error
print(accuracy)
```

The error rate is `r round(error,4)`; equivalently we can express this as accuracy (1-error rate) `r round(accuracy,3)`

```{r}
TER = sum(glm.pred != test_set$PHENOTYPE)/length(glm.pred)
TP = sum(glm.pred == 1 & test_set$PHENOTYPE == 1)
FN = sum(glm.pred == 0 & test_set$PHENOTYPE == 1)
TN = sum(glm.pred == 0 & test_set$PHENOTYPE == 0)
FP = sum(glm.pred == 1 & test_set$PHENOTYPE == 0)
FPR = FP/(FP+TN)
FNR = FN/(FN+TP)
print(paste("FPR is",FPR))
print(paste("FNR is",FNR))
```

False positive rate is `r FPR`, false negative rate is `r FNR`

[**Question for you: why this difference between FPR and FNR?**]{style="color:red"}

## Exercise 4.1

-   implement k-fold cross-validation to assess the performance of the logistic regression model:

```{r}
n = nrow(dogs)
k = 0
```

```{r}
## continue with your code here
```

## ROC curves

We use the same training/test split as above:

```{r, warning=FALSE}
glm.fit <- glm(PHENOTYPE ~ ., data = training_set, family=binomial(link="logit"))

glm.probs <- predict(glm.fit, newdata = test_set[,-1],type="response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(glm.pred,test_set$PHENOTYPE)
```

In this example, both **FPR** and **FNR** are high, therefore we expect the AUC to be low (remember, ROC AUC is based on TPR=1-FNR and FPR).

[**Question for you: what do we consider to be a low AUC value?**]{style="color:red"}

```{r label="roc"}
library("ROCit")

ROCit_logit <- rocit(score=glm.probs,class=test_set$PHENOTYPE)
plot(ROCit_logit)
save(ROCit_logit, file = "ROClogit.RData") ## saving results in a local file
```

We can also have a look at the AUC (Area Under the ROC Curve):

```{r}
paste("AUC is:",print(ROCit_logit$AUC))
```

## MCC: Matthews Correlation Coefficient

$$
\phi = \frac{(TP \cdot TN - FP \cdot FN)}{\sqrt{(TP+FP) \cdot (TP+FN) \cdot (TN+FP) \cdot (TN+FN)}}
$$

```{r}
## function to calculate MCC
mcc = function(tp,fp,tn,fn) {
  phi =(tp*tn - fp*fn) / sqrt((tp+fp)*(fn+tn)*(tp+fn)*(fp+tn))
  return(phi)
}
```

```{r}
conf_matrix = table(glm.pred,test_set$PHENOTYPE)
conf_matrix
```

[**Task for you: Let's identify true and false positives and negatives**]{style="color:yellow"}

```{r}
## uncommetn and complete below
# TP = conf_matrix[,]
# FP = conf_matrix[,]
# TN = conf_matrix[,]
# FN = conf_matrix[,]

# mcc_score = mcc(tp = TP, fp = FP, tn = TN, fn = FN)
# print(mcc_score)
```

Let's use an R package to do the calculations:

```{r}
library("mltools")

df <- data.frame(conf_matrix)
dm <- df |> spread(key = Var2, value = Freq) |> select(-glm.pred) |> data.matrix()
mltools::mcc(confusionM = as.matrix(dm))
```