PDA-DIA/staes-total-maxLFQ.qmd at main · statOmics/PDA-DIA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
---
title: "Tutorial: Staes - total normalisation - MaxLFQ summarisation"
code-tools: true
author:
  - name: Lieven Clement
    orcid: 0000-0002-9050-4370
    affiliation:
      - name: Ghent University
---

```{r, echo = FALSE}
source("https://raw.githubusercontent.com/statOmics/PDA-DIA/master/R/utils.R")
```

# Background

We will use using a publicly available spike-in study published by Staes et al. [@Staes2024].
They spiked digested UPS proteins in a yeast digested background at the following ratio's (yeast:ups ratio 10:1, 10:2, 10:4, 10:8, 10:10).
Here we will use a subset of the data, i.e. dilutions 10:2 and 10:4.
We will use output of the search engine DIA-NN 2.2.0.
The main search output for this DIA-NN version was stored in the report.parquet file in the DIA-NN output directory, which can be found under data/spikein24-staesetal2024.parquet

DIA-NN provides multiple quantifications, e.g. derived from the MS1 or MS2 spectra, and at precursor or protein (protein group) level. The term 'precursor' refers to a charged peptide species and is the basic unit of identification and quantification in DIA. Hence, in the context of DIA we refer to a precursor table, instead of to a PSM table in DDA.

Examples of different quantities are:

- raw MS1 area: Ms1.Area, normalised MS1 Area: Ms1.Normalised, MS2 Precursor quantities: Precursor.Quantity, Normalised MS2 Precursor quantities: Precursor.Normalised, etc., which are all at the precursor level
- MS2 based summary at the protein (protein group)-level: PG.MaxLFQ

Here, we will use the `Precursor.Quantity` column.

# Load packages

We load the `msqrob2` package, along with additional packages for
data manipulation and visualisation.


```{r load_libraries}
library("QFeatures")
library("dplyr")
library("tidyr")
library("ggplot2")
library("msqrob2")
library("stringr")
library("ExploreModelMatrix")
library("MsCoreUtils")
library("matrixStats")
library("patchwork")
library("kableExtra")
library("ComplexHeatmap")
library("purrr")
library("tibble")
library("scater")
```

# Data

## Precursor table

We load the output from DIA-NN parquet file. Can be file path to local file or url to file that lives on the web.

```{r import_data}
precursorFile <- "https://github.com/statOmics/PDA-DIA/raw/refs/heads/main/data/spikein24-staesetal2024.parquet"
```


We can import the report.parquet file using the `read_parquet` function from the `arrow` package.
Note, that older versions of DIA-NN store the output as report.tsv.

```{r}
precursors <- arrow::read_parquet(precursorFile) # function from the arrow package
#precursors <- data.table::fread(precursorFile) # For older versions of DIA-NN, where the results are stored as tsv files. Note that the precursorFile then would point to "report.tsv"
```

Each row in the precursor data table is in "long format" and contains information about one precursor in a specific run (the table below shows the first 6 rows).
The columns contains various descriptors about the precursor, such as its sequence, its charge, run, etc. Some of these columns contain the quantification values, e.g. Ms1.Area, Precursor.Quantity etc.


```{r, echo=FALSE}
knitr::kable(head(precursors))
```

Quick check on distribution of precursors MS2 intensities.

```{r}
precursors |>
  ggplot(aes(x = log2(Precursor.Quantity))) +
  geom_density() +
  theme_minimal()
```

We filter the data to reduce the memory footprint.

```{r}
precursors <- precursors |>
  select(
         Run,
         Precursor.Id,
         Modified.Sequence,
         Stripped.Sequence,
         Precursor.Charge,
         Protein.Group,
         Protein.Names,
         Protein.Ids,
         Genes,
         Precursor.Quantity,
         Precursor.Normalised,
         Normalisation.Factor,
         Ms1.Area,
         Ms1.Normalised,
         PG.MaxLFQ,
         Q.Value,
         Lib.Q.Value,
         PG.Q.Value,
         Lib.PG.Q.Value,
         Proteotypic,
         Decoy, # Not available in older versions of DIA-NN
         RT)
```

We known the ground truth: UPS proteins are differentially abundant (DA, spiked in), Yeast proteins are not.
```{r}
precursors <- precursors |>
  mutate(species = grepl(pattern = "UPS",Protein.Group) |>
           as.factor() |>
           recode("TRUE"="ups","FALSE" = "yeast"))
precursors |>
  pull(species) |>
  table()
```

## Sample annotation table

The [sample annotation table](#sec-annotation_table)) is not available
and can be generated from the run labels, as the researchers included information on the design in the filenames.

We will make a new data frame with the annotation.

1. We first generate variable `runCol`, that gives an overview of the different runs in the experiment, i.e. the unique names in the column Run of the report file. This is a mandatory column in the annotation file that is required by the `readQFeatures` function that will be used to generate a QFeatures object with the quant data.

2. Next we generate variable `sampleId`, to pinpoint the different samples in the dataset. This can be extracted from the run names as the researchers have stored information on the meta data in the file names for each run. E.g. for run B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_2 this is the ratio, i.e. ratio04, and the replicate, i.e. _2 after DIA. Note, that the first replicate has no number.

    a. We first replace the redundant pattern "DIA" with an empty character string.
    b. Next we split the strings according to the pattern "UPS2" using the `strsplit` function and
    c. keep the right part of the string. We do this by looping over the list from `strplit`  with an sapply loop, which takes the output of b. as its first argument, uses the function '[' to subset vector of strings in each list element and uses an optional argument '2' to select the second string of the vector (right part).

3. We make add the variable `ratio` by parsing the variable stringId and

    a. replacing pattern "ratio" by an empty string using the `gsub` function
    b. splitting the output of `gsub` according to pattern "_" and
    c. keeping the left part of the string (first element of the vector of strings in each list item of the substr output)
    d. converting the output of c to an integer
    e. converting the ratio into a factor

4. We make add the variable `rep` by parsing the variable stringId, and

    a. splitting it according to pattern "_",
    b. keeping the right part of the string (second element of the vector of strings in each list item of the substr output)
    c. replacing NA by 1, as for the first replicate no number was added to the filename
    d. converting it into a factor

```{r create_metadata}
annot <- data.frame(runCol = precursors |>
                       pull(Run) |>
                       unique() # 1.
                   ) |>
    mutate(sampleId = gsub(x = runCol, pattern = "_DIA", replacement = "") |> #2.a
               str_split("UPS2_") |> #2.b
               sapply(`[`, 2) #2.c
           ) |>
    mutate(
           condition =  gsub("ratio", "", sampleId) |> #3.a
               str_split("_") |> #3.b
               sapply(`[`, 1) |> #3.c
               as.numeric() |> #3.d
               as.factor(), #3.e
           rep = sampleId |>
               str_split("_") |> #4.a
               sapply(`[`, 2) |> #4.b
               replace_na(replace = "1") |> #4.c
               as.factor(), #4.d
           ratio = condition |>
             as.character() |>
             as.double()
           )
```

## Convert to QFeatures

First, recall that the precursor table is file in long format.
Every quantitative column in the precursor table contains
information for multiple runs. Therefore, the function split the table
based on the run identifier, given by the `runCol` argument (for
DIA-NN, that identifier is contained in `run`).
So, the
`QFeatures` object after import will contain as many sets as there are
runs.
Next, the function links the annotation table with the PSM data.
To achieve this, the annotation table must contain a `runCol` column
that provides the run identifier in which each sample has been
acquired, and this information will be used to match the identifiers
in the `Run` column of the precursor table.

Here, we will use the `Precursor.Quantity` column as quantification input.

Note, that we filter a number of variables to reduce the footprint of the QFeatures object.

```{r}
(qf <- readQFeatures(assayData = precursors,
                              colData = annot,
                              quantCols = "Precursor.Quantity",
                              runCol = "Run",
                              fnames = "Precursor.Id"))
```


# Data preprocessing{#sec-basic_preprocess}

The data preprocessing workflow for DIA data is similar to the workflow for DDA-LFQ data, but there are suble differences as we start from precursor level data and have additional columns.

## Encoding missing values

We first replace any zero in the quantitative data
with an `NA`.

```{r}
qf <- zeroIsNA(qf, names(qf))
```

Note that `msqrob2` can handle missing data without having to rely on
hard-to-verify imputation assumptions, which is our general recommendation. However, `msqrob2` does not
prevent users from using imputation, which can be performed with
`impute()` from the `QFeatures` package.

## Precursor Filtering

Filtering removes low-quality and unreliable precursors that would otherwise introduce noise and artefacts in the data.

### Remove questionable identifications


We apply standard filtering advised by DIA-NN:

1. q-value threshold of 0.01 for the identification of precursors (`Q.Value`) and protein groups (`PG.Q.Value`). If the file is processed via matching between runs, it is also useful to filter on Lib.Q.Value and Lib.PG.Q.Value.

2. Remove precursors that could not be mapped, i.e. when `Precursor.Id` column is an empty string.

3. Filter decoys, i.e. only keep precursors for which the `Decoy` column equals 0. (Note, that the `Decoy` column is not present in the output of older versions of DIA-NN)

4. Keeping only proteotypic peptides, which map uniquely to a specific protein.


```{r}
qf <-  qf |>
  filterFeatures(~ Q.Value <= 0.01 & #1.
                     PG.Q.Value <= 0.01 &  #1.
                     Lib.Q.Value <= 0.01 & #1.
                     Lib.PG.Q.Value <= 0.01 & #1.
                     Precursor.Id != "" & #2.
                     Decoy == 0 & #3. (Note, that this filter is not available with previous versions of DIA-NN, as the report.tsv file did not include a Decoy column. So other strategies are needed if Decoys are in the output file.)
                     Proteotypic == 1) #4.
```


Note, that it is important that the filtering criteria are not distorting the distribution of the test statistics in the downstream analysis for features that are non-DA.
It can be shown that filtering will not induce bias results when the filtering criterion is independent of test statistic. The criteria that we proposed above are all based on the results of the identification step, hence, they are independent of the downstream test statistics that will be used to prioritize DA proteins.

### Assay joining

Up to now, the data from different runs were kept in separate assays. We can now join the normalised sets into an precursor set using joinAssays(). Sets are joined by stacking the columns (samples) in a matrix and rows (features) are matched according to a row identifier, here the `Precursor.Id`.
We will store the result in the assay with name: `precursor`.

```{r}
(qf <- joinAssays(
  x = qf,
  i = names(qf),
  fcol = "Precursor.Id",
  name = "precursors"
  ))
```

### Filtering: Remove highly missing precursors

We keep peptides that were observed at last 4 times out of the $n
= 6$ samples, so that we can estimate the peptide characteristics.
We tolerate the following proportion of NAs:
$\text{pNA} = \frac{(n - 4)}{n} = 0.33$, so we keep peptides that
are observed in at least 66.6% of the samples, which roughly corresponds to 2 samples per condition. This is an arbitrary
value that may need to be adjusted depending on the experiment and the
data set.

```{r}
nObs <- 4
n <- ncol(qf[["precursors"]])

(qf <- filterNA(qf, i = "precursors", pNA = (n - nObs) / n))
```


### Filter one-hit wonders

Here, we remove proteins that can only be found by one peptide, as such proteins may not be trustworthy.

1. We first calculate how many distinct peptides map to each protein (`Protein.Group`). We use the stripped precursor sequence, i.e. sequence of the base peptide for this purpose.

2. We store this information in the row data of the precursors assay

3. We filter precursors of one-hit wonder proteins.

```{r}
# Filter for peptides per protein
pepsPerProtDf <- qf[["precursors"]] |>
    rowData() |>
    data.frame() |>
    dplyr::select("Stripped.Sequence", "Protein.Group") |>
    group_by(Protein.Group) |>
    mutate(pepsPerProt = Stripped.Sequence |>
               unique() |> length()
           ) #1.

rowData(qf[["precursors"]])$pepsPerProt <- pepsPerProtDf$pepsPerProt #2.

qf <- filterFeatures(qf,
                              ~ pepsPerProt > 1,
                              keep = TRUE) #3.
```

## Log-transformation


We perform log2-transformation with `logTransform()` from the
`QFeatures` package. We use `base = 2` and store the result in a new summarized experiment named `precursors_log`

```{r}
qf <- logTransform(qf,
                            base = 2,
                            i = "precursors",
                            name = "precursors_log")
```

## Normalisation {#sec-norm}

The most common objective of MS-based proteomics experiments is to
understand the biological changes in protein abundance between
experimental conditions. However, changes in measurements between
groups can be caused due to technical factors. For instance, there are
systematic fluctuations from run-to-run that shift the measured
intensity distribution. We can this explore as follows:

1. We extract the sets containing the log transformed data. This is
performed using `QFeatures`' 3-way subsetting.
2. We use `longForm()` to convert the `QFeatures` object into a long
table, including `condition` and `concentration` for filtering and
colouring.
3. We visualise the density of the quantitative values within each
sample. We colour each sample based on its spike-in condition.

```{r}
qf[, , "precursors_log"] |> #1.
  longForm(colvars = colnames(colData(qf))) |>  #2.
  data.frame() |>
  filter(!is.na(value)) |>
  ggplot() + #3.
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal()
```

There are many ways to perform the normalisation. Here we assess normalisation based on the total ion intensity.

```{r}
qf <- sweep( #4. Subtract log2 norm factor column-by-column (MARGIN = 2)
  qf,
  MARGIN = 2,
  STATS = nf_log_tot(qf,"precursors_log"),
  i = "precursors_log",
  name = "precursors_norm"
)
```

We explore the effect of the global normalisation in the subsequent plot.

Formally, the function applies the following operation on each sample
$i$ across all precursors $p$:

$$
y_{ip}^{\text{norm}} = y_{ip} - \log_2(nf_i)
$$

with $y_ip$ the log2-transformed intensities and $nf_i$ the log2-transformed norm factor. Upon normalisation, we can see
that the distribution of the $y_{ip}^{\text{norm}}$ nicely overlap (using the same code as above)

```{r}
qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |>
  data.frame() |>
  filter(!is.na(value)) |>
  ggplot() +
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  labs(subtitle = "Normalised log2 precursor intensities") +
  theme_minimal()
```


## Summarisation


Here, we summarise the precursor-level data into protein intensities using maxLFQ.
maxLFQ first calculates all pairwise log ratio's between samples only using their shared precursors.
Particularly, it uses the median of the log ratio's between the shared precursors s, i.e.
$$r_{ij} = median(y_{sj}-y_{si})$$.
Hence, it first eliminates peptide effects.
It then estimates the summaries by solving

$$
\sum_i\sum_j(y^{prot}_j - y^{prot}_i - r_ij)^2
$$
It is implemented in the `maxLFQ` function of the `iq` package.

`aggregateFeatures()` streamlines summarisation. It requires the name
of a `rowData` column to group the precursors into proteins (or
protein groups), here `Protein.Group`. We provide the summarisation
approach through the `fun` argument. Other summarisation methods
are available from the `MsCoreUtils` package, see `?aggregateFeatures`
for a comprehensive list. The function will return a `QFeatures`
object with a new set that we called `proteins`.

```{r, warning=FALSE}
(qf <- aggregateFeatures(
  qf, i = "precursors_norm",
  name = "proteins",
  fcol = "Protein.Group",
  fun = function(X) iq::maxLFQ(X)$estimate
))
```

# Data exploration and QC

Data exploration aims to highlight the main sources of variation in
the data prior to data modelling and can pinpoint to outlying or
off-behaving samples.

## Marginal distribution at precursor and protein level


```{r}
qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |>
  data.frame() |>
  filter(!is.na(value)) |>
  ggplot() +
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 precursor intensities")
```

```{r}
qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |>
  data.frame() |>
  filter(!is.na(value)) |>
  ggplot() +
  aes(x = sampleId,
      y = value,
      colour = condition,
      group = colname) +
  xlab("sample") +
  geom_boxplot() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 precursor intensities")
```

```{r}
qf[, , "proteins"] |>
  longForm(colvars = colnames(colData(qf))) |>
  data.frame() |>
  filter(!is.na(value)) |>
  ggplot() +
  aes(x = sampleId,
      y = value,
      colour = condition,
      group = colname) +
  xlab("sample") +
  geom_boxplot() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 protein intensities")
```

```{r}
qf[, , "proteins"] |>
  longForm(colvars = colnames(colData(qf))) |>
  data.frame() |>
  filter(!is.na(value)) |>
  ggplot() +
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 protein intensities")
```

## Charge state

```{r}
qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf)), rowvars = "Precursor.Charge") |>
  as.data.frame() |>
  filter(!is.na(value)) |>
  filter(Precursor.Charge<=4) |>
ggplot(aes(x = sampleId)) +
  geom_bar(aes(fill = factor(Precursor.Charge, levels = 4:1)),
           colour = "black") +
  labs(title = "Peptide types per sample",
       x = "Sample",
       fill = "Charge state") +
  theme_bw()
```

## Identifications per sample

```{r}
qf[,,"precursors_norm"] |>
  longForm(colvars = colnames(colData(qf)),
           rowvars= c("Precursor.Id",
                      "Protein.Group",
                      "Stripped.Sequence",
                      "Precursor.Charge")) |>
  data.frame() |>
  filter(!is.na(value)) |>
  mutate(cond_rep = paste0("ratio", condition, "_", rep)) |>
  group_by(condition, cond_rep) |>
  summarise(Precursors = length(unique(Precursor.Id)),
            `Protein Groups` = length(unique(Protein.Group))) |>
  pivot_longer(-(1:2),
               names_to = "Feature",
               values_to = "IDs") |>
  ggplot(aes(x = cond_rep, y = IDs, fill = condition)) +
  geom_col() +
  #scale_fill_observable() +
  facet_wrap(~Feature,
             scales = "free_y") +
  labs(title = "Precursor and protein group identificiations per sample",
       x = "Sample",
       y = "Identifications") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))
# Data Modeling (Robust Regression){#sec-modelling}
```

## Dimensionality reduction plot

A common approach for data exploration is to
perform dimension reduction, such as Multi Dimensional Scaling (MDS).
We will first extract the set to explore along the sample annotations
(used for plot colouring).

```{r}
se <- getWithColData(qf, "proteins")
```


We then use the `scater` package to compute and plot the PCA. For
technical reasons, it requires `SingleCellExperiment` class object,
but these can easily be generated from a `SummarizedExperiment`
object.

```{r}
se <- runMDS(as(se, "SingleCellExperiment"), exprs_values = 1)
```

We can now explore the data structure while coloring for the factor of
interest, here `condition`.

```{r fig.width=8, fig.height=3, fig.cap="MDS visualisation of the spike-in data set. Each point represents a sample and is coloured by the spike-in condiion."}
plotMDS(se, colour_by = "condition", shape_by = "rep")
```

This plot reveals interesting information. First, we see that the
samples are nicely separated according to their spike-in condition.
Interestingly, the conditions are sorted by the concentration.

## Correlation matrix

```{r}
corMat <- qf[["proteins"]] |>
  assay() |>
  cor(method = "spearman", use = "pairwise.complete.obs")

colnames(corMat) <- qf$sampleId
rownames(corMat) <- qf$sampleId
corMat |>
  ggcorrplot::ggcorrplot() +
  scale_fill_viridis_c() +
  labs(title = "Correlation matrix",
       fill = "Correlation") +
  theme(axis.text.x = element_text(angle = 90))
```

# Data Modeling (Robust Regression){#sec-modelling}

## Model estimation

1. We first define the model. We only have one sources of variability in the experiment that we can model, i.e. the effect of the spike-in condition.

2. We fit the model to each protein in the `proteins` summarised experiment of the QFeatures object `qf` using the msqrob function.

```{r warning=FALSE}
model <- ~ condition

qf <- msqrob(
  qf,
  i = "proteins",
  formula = model,
  robust = TRUE)
```

We enabled M-estimation (`robust = TRUE`) for improved robustness
against outliers.

The fitting results are available in the `msqrobModels` column of the
`rowData`. More specifically, the modelling output is stored in the
`rowData` as a `statModel` object, one model per row (protein). We
will see in a later section how to perform statistical inference on
the estimated parameters.

```{r}
models <- rowData(qf[["proteins"]])[["msqrobModels"]]
models[1:3]
```
## Inference

We can now convert the research question "do the spike-in
conditions affect the protein intensities?" into a statistical
hypothesis.

In other words, we need to translate this question in a null and alternative hypothesis on a single model parameter or a linear combination of model parameters, which is also referred to with a [contrast](#sec-inference).

To aid defining contrasts, we will visualise the experimental design using the `ExploreModelMatrix` package.

```{r}

vd <- ExploreModelMatrix::VisualizeDesign(
    sampleData =  colData(qf),
    designFormula = model,
    textSizeFitted = 4
)
vd$plotlist
```

We have a two group comparison and the average log2 fold change between condition 4 and condition 2 can be estimated using parameter:

$$
\log_2 \widehat{\text{FC}}^{4-2} = \hat\mu^4 - \hat\mu^2 = ((Intercept) + condition4) - (Intercept) = condition4
$$
So we can assess the null hypothesis that a protein is not differentially abundant between spike-in condition 4 and spike-in condition 2 as follows
$$
H_0 \log_2 \widehat{\text{FC}}^{4-2} = 0 \text{ or } condition4 = 0
$$
Below we define the contrast and construct the corresponding contrast matrix.

```{r}
contrast <- "condition4 = 0"
(L <- makeContrast(
  contrast,
   parameterNames = colnames(vd$designmatrix)
))
```


We assess the contrast for each protein.

```{r warning=FALSE}
qf <- hypothesisTest(qf, i = "proteins", contrast = L)
```

We extract the results table from the `proteins` summarised experiment in the `qf` object.

```{r}
inference <-
  msqrobCollect(qf[["proteins"]], L)
```

## Report results

We report the results using a results table, volcano plots and heatmaps.

### Results table

We report all results that are significant at the nominal FDR level of 5%.

1. We define the nominal FDR level
2. We filter the results
3. We arrange them according to statistical significance

```{r}
alpha  <- 0.05
inference |>
  filter(adjPval < alpha) |>
  arrange(pval) |>
  knitr::kable()
```

We observe that the results only contain spiked UPS proteins.

### Volcanoplots

```{r}
volcanoplot4 <- plot_volcano(inference)
volcanoplot4
```

Note, that the log fold changes are nicely around the real log2 fold change: $log2(4/2) = 1$

### Heatmaps

We make the heatmap as follows

1. We select the names of the proteins that were declared significant between condition A and condition B and extract their quantitative data.

2. We extract the quantitative data with `assay()` and scale by rows.

3. We will create a heatmap using the ComplexHeatmap package, which enables heatmap annotations. We will annotate the heatmap using our model variables condition and lab.

4. We make the heatmap

```{r}
sig <- inference |>
  filter(adjPval < alpha) |>
  arrange(pval) |>
  pull(feature) #1.

se <- getWithColData(qf, "proteins")
quants <- t(scale(t(assay(se[sig,])))) #2.

colnames(quants) <- paste0("con", se$condition,"_rep",se$rep) #specific to this dataset to get short colnames
annotations <- columnAnnotation(
  condition = se$condition
) #3.

set.seed(1234) ## annotation colours are randomly generated by default
heatmap <- Heatmap(
                       quants,
                       name = "log2 intensity",
                       top_annotation = annotations,
                       column_title = paste0(contrast, " = 0")
                       ) #4.
heatmap
```

### Detail plots

We can explore the data for a protein to validate the statistical inference results. For example, let’s explore the normalised precursor and the summarised protein intensities for the protein with the most significant log2 fold change.

```{r}
(targetProtein <- inference |>
   slice(which.min(pval)) |>
   pull(feature)
)
```

To obtain the required data, we perform a little data manipulation pipeline:

We use the QFeatures subsetting functionality to retrieve all data related to and focusing on the peptides_log and proteins sets that contains the peptide ion data used for model fitting. We then convert the data with longForm() for plotting. Finally, we plot the log2 normalised intensities for each sample at the protein and at the peptide level. Since multiple peptides are recorded for the protein, we link peptides across samples using a grey line. Samples are colored according to E. coli spike-in condition.

```{r}
qf[targetProtein, , c("precursors_norm", "proteins")] |> #1
  longForm(colvars = colnames(colData(qf))) |> #2
  data.frame() |>
  ggplot() +
  aes(x = colname,
      y = value) +
  geom_line(aes(group = rowname), linewidth = 0.1) +
  geom_point(aes(colour = condition)) +
  facet_wrap(~ assay, scales = "free") +
  ggtitle(targetProtein) +
  theme_minimal() +
  theme(axis.text.x = element_blank())
```

```{r}
qf[targetProtein, , c("precursors_norm", "proteins")] |> #1
  longForm(colvars = colnames(colData(qf))) |> #2
  data.frame() %>%
  {
  ggplot(.) +
  aes(x = condition,
      y = value) +
  geom_boxplot(aes(colour = condition)) +
  facet_wrap(~ assay, scales = "free") +
  geom_jitter(aes(shape = rowname)) +
  scale_shape_manual(values = seq_len(dplyr::n_distinct(.$rowname))) +
  ggtitle(targetProtein) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) +
  guides(shape = "none")
  }
```


# Assess performance

Note, that the evaluations in this section are typically not possible on real experimental data.
So it has to be deleted if you reuse this script as a basis for analysing data from real experiments.

We first reconstruct the inference table and add the species information, which give us info on the ground truth, i.e. yeast features are non-DA and ups features are DA

```{r}
inference <-
  msqrobCollect(qf[["proteins"]], L) |>
    mutate(species =  rep(rowData(qf[["proteins"]])$species, ncol(L)))
```

Indeed, we are using a spike-in dataset so we know the ground truth: all human UPS proteins are differentially abundant (DA) between the conditions and the yeast proteins are non-DA.

We use a nominal FDR level of 0.05

```{r}
alpha <- 0.05
```

## Real Fold changes

As this is a spike-in study with known ground truth, we can also plot the log2 fold change distributions against the expected values, in this case 0 for the yeast proteins and the difference of the log concentration for the spiked-in UPS standards. We first create a small table with the real values.

As this is a spike-in study with known ground truth, we can also plot the log2 fold change distributions against the expected values, in this case 0 for the yeast proteins and the difference of the log concentration for the spiked-in UPS standards. We first create a small table with the real values.

1. We extract the levels from factor condition
2. We convert then into a number as they refer to the real ratio
3. We log2-transform them
4. We calculate all pairwise differences between log2 transformed ratio's to obtain the log2 fold changes

We use a similar operation to construct the name of the corresponding contrast.

```{r}
(realLogFC <- data.frame(
    logFC = colData(qf)$condition |>
        levels() |> # 1.
        as.double() |> # 2.
        log2() |> # 3.
        combn(m=2, FUN = diff), #4.
    contrast = colData(qf)$condition |>
        levels() |>
        combn(m=2, FUN= function(x) paste0(x[2]," - ",x[1])) |>
        recode("4 - 2" = "condition4",
               "8 - 2" = "condition8",
               "8 - 4" = "condition8 - condition4")
    )
)
```

We now evaluate of the estimated fold changes correspond to the real fold changes.

1. We plot the log fold changes in function of the spikein condition and color according to spikein condition.
2. We add a boxplot layer.
3. We use custom colors.
4. We add a vertical line at 0, which corresponds to the known log2 fold change for yeast proteins
5. We add a vertical lines for known log2 fold changes of the spiked UPS proteins.

```{r}
logFC <- inference |>
  filter(!is.na(logFC)) |>
  ggplot(aes(x = species, y = logFC, color= species)) + #1.
  geom_boxplot() + #2.
  theme_bw() +
  scale_color_manual(
    values = c("grey20", "firebrick"), #3.
    name = "",
    labels = c("yeast", "ups")
    ) +
  geom_hline(yintercept = 0, color="grey20") + # 4.
  facet_wrap(~contrast) +
  geom_hline(aes(yintercept = logFC), color="firebrick", data=realLogFC) #5.
logFC
```

## True and false positives

All human UPS proteins are differentially abundant (DA) between the conditions and the yeast proteins are non-DA.
We make a new variable spikein in the inference_list.

```{r}
tpFpTable <- group_by(inference, contrast) |>
    filter(adjPval < alpha) |>
    summarise("TP" = sum(species == "ups"),
              "FP" = sum(species != "ups"),
              "FDP" = mean(species != "ups"))
tpFpTable
```


## TPR - FDP curves

We generate the TPR-FDP curves to assess the performance of the different workflows to prioritise differentially abundant proteins. Again, these curves are built using the ground truth information about which proteins are differentially abundant (spiked in) and which proteins are constant across samples. We create two functions to compute the TPR and the FDP.

```{r}
computeFDP <- function(pval, tp) {
    ord <- order(pval)
    fdp <- cumsum(!tp[ord]) / 1:length(tp)
    fdp[order(ord)]
}
computeTPR <- function(pval, tp, nTP = NULL) {
    if (is.null(nTP)) nTP <- sum(tp)
    ord <- order(pval)
    tpr <- cumsum(tp[ord]) / nTP
    tpr[order(ord)]
}
```

We apply these functions and compute the corresponding metric using the statistical inference results and the ground truth information.

```{r}
performance <- inference |>
    na.exclude() |>
    mutate(tpr = computeTPR(pval, species == "ups"),
           fdp = computeFDP(pval, species == "ups")) |>
    arrange(fdp)
```

We also highlight the observed FDP at a $alpha = 5\%$ FDR threshold.

```{r}
workPoints <- performance |>
  group_by(contrast) |>
    filter(adjPval < 0.05) |>
    slice_max(pval)
```

```{r}
ggplot(performance) +
    aes(
        y = fdp,
        x = tpr,
    ) +
    geom_line() +
    geom_point(data = workPoints, size = 3) +
    geom_hline(yintercept = 0.05, linetype = 2) +
    coord_flip(ylim = c(0, 0.2)) +
    theme_minimal()
```


## volcano-plots

We make  a volcano plot where we add the species information.

```{r}
  inference |>
  plot_volcano() +
  aes(color = species, shape = adjPval < alpha) +
    scale_color_manual(
      values = c("grey20", "firebrick"),
      name = "species",
      labels = c("yeast", "ups")
      ) +
  labs(shape="FDR < 0.05") +
  geom_vline(aes(xintercept = logFC), data = realLogFC , col ="firebrick") +
  geom_hline(aes(yintercept = -log10(adjAlpha)), data = inference |>
               summarize(adjAlpha = alpha *mean(adjPval < alpha, na.rm=TRUE), .groups="drop"))
```