PDA-DIA/01-tutorial-data-processing.qmd at main · statOmics/PDA-DIA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
title:  Tutorial on data processing and msqrob2 analysis of experiments with simple designs
---


The result of a quantitative analysis is a list of precursor, peptide and/or protein abundances for every protein in different samples. In this tutorial we introduce a generic workflow for differential analysis of quantitative datasets with simple experimental designs.

In order to extract relevant information from these high troughput datasets, we will use our [msqrob2](https://www.bioconductor.org/packages/release/bioc/html/msqrob2.html) software tool.

# Staes spike-in study (DIA-NN output)

We will use using a publicly available spike-in study published by Staes et al. [@Staes2024].
They spiked digested UPS proteins in a yeast digested background at the following ratio's (yeast:ups ratio 10:1, 10:2, 10:4, 10:8, 10:10).
Here we will use a subset of the data, i.e. dilutions 10:2 and 10:4.

We will use output of the search engine DIA-NN 2.2.0.
The main search output for this DIA-NN version was stored in the report.parquet file in the DIA-NN output directory, which can be found under data/spikein24-staesetal2024.parquet

DIA-NN provides multiple quantifications, e.g. derived from the MS1 or MS2 spectra, and at precursor or protein (protein group) level. The term 'precursor' refers to a charged peptide species and is the basic unit of identification and quantification in DIA. Hence, in the context of DIA we refer to a precursor table, instead of to a PSM table in DDA.

Examples of different quantities are:

- raw MS1 area: Ms1.Area, normalised MS1 Area: Ms1.Normalised, MS2 Precursor quantities: Precursor.Quantity, Normalised MS2 Precursor quantities: Precursor.Normalised, etc., which are all at the precursor level
- MS2 based summary at the protein (protein group)-level: PG.MaxLFQ


[1.a] Participants can perform an analysis using the [Rmarkdown script](./staes-median-maxLFQ.html). Here, we will use the `Precursor.Quantity` column. Follow the steps in the script and try to understand each of the analysis steps. We know the real FC for the spike in proteins and the yeast proteins (see description of the data). What do you observe?

[1.b] Repeat the analysis and change the normalisation by using normalisation factors calculated based on (i) the total intensity (`nf_log_tot` function with arguments `nf_log_tot(qf, "precursors_log")`) and (ii) median of ratio's method (`nf_log_medrat` function with arguments `nf_log_medrat(qf, "precursors_log")`). What do you observe and try to explain this.

[1.c] Repeat the analysis starting from the `Precursor.Normalised` column. What do you observe? Are all data processing steps needed?

[1.d] Repeat the analysis, again use the basis script with median normalisation factors (`nf_log_med` function). First change the summarisation with the `aggregateFeatures` function by replacing maxLFQ with median polish (`fun = MsCoreUtils::medianPolish`), then use simple median summarisation (`fun = matrixStats::colMedians`).  (Note, that you also have to add an additional argument for median polish and median summarisation to handle missing values, i.e. add an additional argument `na.rm = TRUE`, i.e. replace `fun = function(X) iq::maxLFQ(X)$estimate` with `fun = MsCoreUtils::medianPolish, na.rm = TRUE`). What do you observe and try to explain this.


# Staes spike-in study - Spectronaut output

Redo the data analysis for the spectronaut output file.

- [Spectronaut - Raw MS2](01-dataprocessing-spectronaut.qmd)
- [Spectronaut - Processed MS2](01-dataprocessing-spectronaut-pn.qmd)


# Mouse Diet study (Spectronaut)

With the PXD059421 data deposited on ProteomeXchange researchers study the molecular effects of dietary DINCH exposure, on the proteome, phosphoproteome and acetylome profiles of visceral (VIS) and subcutaneous (SC) adipose tissue in a model of diet-induced obesity in male and female C57BL/6N mice. This study includes data on visceral and subcutaneous adipose tissue of female and male mice that were either fed a standard plant-based diet (chow), a standard high-fat diet (HFD) or two HFD diets including doses of DINCH (4,500 ppm and 15,000 ppm). Three female and three male mice were used for each diet [@AldehoffEtAl2025].

The data were downloaded from Pride and reprocessed using spectronaut.
Here, we will focus on the data from the proteome MS runs.
The spectronaut file can be imported using following parquet file:

```{r eval = FALSE}
precursorFile = "https://github.com/statOmics/PDA-DIA/raw/refs/heads/main/data/mouseDiet-spectronaut.parquet"
```

Note, that there is no EG_PrecursorId variable exported in the output. The PrecursorId, however, can be easily constructed using EG_ModifiedSequence and FG_Charge.

```{r eval = FALSE}
precursors <- precursors |>
  mutate(EG_PrecursorId = paste0(EG_ModifiedSequence, FG_Charge))
```

You only have to do the data processing (up to section 5.5 in the [basic script](./staes-median-maxLFQ.html)). We will continue with the modeling in the [tutorial Data modeling](03-tutorial-Design.html).

Intermediate: [mouse-diet-upto-filtering.html](./mouse-diet-upto-filtering.html)

# Van Leene et al. study.

This data is a subset of the data from [@VanLeeneEtal2026], where a Limited Proteolysis (LiP) Treatment was conducted with rapamicin.
The treatment consists of either 10 μM rapamycin (Sigma-Aldrich) in 0.1% dimethyl sulfoxide (DMSO; Sigma-Aldrich) or 0.1% DMSO as vehicle control.
In al LiP analysis Limited Proteolysis samples are typically taken together with trypsin control samples. The former are used to prioritise precursors that indicate conformational dynamics. The latter samples are used a control samples so as to correct for overall changes in the abundance of the entire proteome.

Here, we will only use the trypsin control samples and perform differential abundance at the protein level.
Analyse the data to prioritise proteins that are DA between rapamicin and control.
Use either the DIA-NN or spectronaut output.
Note, that the TC samples were search together with the LiP samples.
So retention time normalisation does not seem to be advisable as the proteome complexity of LiP and TC samples is expected to differ across retention time.

DIA-NN parquet file can be found online:

```{r eval=FALSE}
precursorFile = "https://github.com/statOmics/PDA-DIA/raw/refs/heads/main/data/rapamicin-diann.parquet"
```

Spectronaut parquet file can be found online. **Parquet file only includes a limited number of variables**):

```{r eval=FALSE}
precursorFile = "https://github.com/statOmics/PDA-DIA/raw/refs/heads/main/data/rapamicin-spectronaut.parquet"
```


# References