2
2
title : " Peak Matrix Processing for metabolomics datasets"
3
3
author :
4
4
name : " Andris Jankevics"
5
- affiliation : Phenome Centre Birmingham, University of Birmingham
5
+ affiliation : Phenome Centre Birmingham, University of Birmingham, UK
6
6
7
7
8
8
package : pmp
@@ -30,8 +30,34 @@ knitr::opts_chunk$set(
30
30
)
31
31
```
32
32
33
+ # Introduction
34
+
35
+ Metabolomics data (pre-)processing workflows consist of multiple steps
36
+ including peak picking, quality assessment, missing value imputation,
37
+ normalisation and scaling. Several software solutions (commercial and
38
+ open-source) are available for raw data processing, including r-package XCMS,
39
+ to generate processed outputs in the form of a two dimensional data matrix.
40
+
41
+ These outputs contain hundreds or thousands of so called "uninformative" or
42
+ "irreproducible" features. Such features could strongly hinder outputs of
43
+ subsequent statistical analysis, biomarker discovery or metabolic pathway
44
+ inference. Common practice is to apply peak matrix validation and filtering
45
+ procedures as described in @guida2016 , @broadhurst2018 and @schiffman2019 .
46
+
47
+ Functions within the ` pmp ` (Peak Matrix Processing) package are designed to
48
+ help users to prepare data for further statistical data analysis in a fast,
49
+ easy to use and reproducible manner.
50
+
51
+ This vignette showcases a range of commonly applied Peak Matrix Processing
52
+ steps for metabolomics datasets.
53
+
33
54
# Installation
34
55
56
+ You should have R version 4.0.0 or above and Rstudio installed to be able to
57
+ run this notebook.
58
+
59
+ Execute following commands from the R terminal.
60
+
35
61
``` {r eval=FALSE, include=TRUE}
36
62
if (!requireNamespace("BiocManager", quietly = TRUE))
37
63
install.packages("BiocManager")
@@ -44,48 +70,27 @@ library(SummarizedExperiment)
44
70
library(S4Vectors)
45
71
```
46
72
47
- # Introduction
48
-
49
- Metabolomics data processing workflows consist of multiple steps including peak
50
- picking or raw data processing, quality assurance, missing value
51
- imputation, normalisation and scaling. Several tools (commercial,
52
- R and non-R based) are commonly used for raw data processing
53
- which generate outputs in the form of a two dimensional data matrix and meta
54
- data.
55
-
56
- These outputs contain hundreds or thousands of so called uninformative or
57
- unreproducible features. Such features could strongly hinder outputs of
58
- subsequent statistical analysis, biomarker discovery or metabolic
59
- pathway inference. Common practice is to apply peak matrix validation and
60
- filtering procedures as described in @guida2016 , @broadhurst2018 and
61
- @schiffman2019 .
62
-
63
- Functions within ` pmp ` package are designed to help users to prepare data for
64
- further statistical data analysis in fast, easy to use and reproducible manner.
65
-
66
- This document showcases the commonly used peak matrix processing steps of
67
- metabolomics datasets.
68
-
69
73
# Data formats
70
74
71
- Recent review for R packages in metabolomics
72
- [ @stanstrup2019 ] covers a broad range of heterogenous tools is availiable as
73
- part of ` Bioconductor ` sofware collection or on ` CRAN ` , ` Github ` and similar
74
- public repositories. ` pmp ` package utilises ` r Biocpkg("SummarizedExperiment") `
75
- class from Bioconductor for data input and output.
75
+ Recently a review by [ @stanstrup2019 ] reported and discussed a broad range of
76
+ heterogeneous R tools and packages that are available via ` Bioconductor ` ,
77
+ ` CRAN ` , ` Github ` and similar public repositories.
78
+
79
+ ` pmp ` package utilises ` r Biocpkg("SummarizedExperiment") ` class from
80
+ Bioconductor for data input and output.
76
81
77
82
For example, outputs from widely used ` r Biocpkg("xcms") ` package can be
78
- relatively easy converted to ` SummarizedExperiment ` object using functions
79
- ` featureDefinitions ` , ` featureValues ` and ` pData ` on ` xcms ` output object.
83
+ converted to a ` SummarizedExperiment ` object using functions
84
+ ` featureDefinitions ` , ` featureValues ` and ` pData ` on the ` xcms ` output object.
80
85
81
- Additioanlly ` pmp ` supports to input data to be any matrix-like ` R ` data
82
- structure (e.g. and ordinary matrix, a data frame). If input if a matrix-like
83
- structure tools from ` pmp ` package will perform several checks for data
84
- integrity as well. Please see section \@ ref(endomorphisms) for more details.
86
+ Additionally ` pmp ` also supports any matrix-like ` R ` data
87
+ structure (e.g. an ordinary matrix, a data frame) as an input. If the input is
88
+ a matrix-like structure ` pmp ` will perform several checks for data integrity.
89
+ Please see section \@ ref(endomorphisms) for more details.
85
90
86
91
# Example dataset, MTBLS79
87
92
88
- In this tutorial we will be using an direct infusion mass spectrometry (DIMS)
93
+ In this tutorial we will be using a direct infusion mass spectrometry (DIMS)
89
94
dataset consisting of 172 samples measured across 8 batches and is included in
90
95
` pmp ` package as ` SummarizedExperiemnt ` class object ` MTBLS79 ` .
91
96
More detailed description of the dataset is available from @kirwan2014 ,
@@ -121,11 +126,11 @@ MTBLS79_filtered
121
126
sum(is.na(assay(MTBLS79_filtered)))
122
127
```
123
128
124
- Missing values sample filter has removed two samples from the initial dataset.
125
- Outputs from any ` pmp ` function can be used as inputs for another function. For
126
- example we can apply missing value filter across features on the output of the
127
- previous command. Command below will filter only within quality control (QC)
128
- sample group.
129
+ Missing values sample filter has removed two samples from the dataset.
130
+ Outputs from any ` pmp ` function can be used as inputs for another ` pmp `
131
+ function. For example we can apply missing value filter across features on the
132
+ output of the previous call. The function call below will filter features based
133
+ on the quality control (QC) sample group only .
129
134
130
135
``` {r}
131
136
MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9,
@@ -136,9 +141,9 @@ MTBLS79_filtered
136
141
sum(is.na(assay(MTBLS79_filtered)))
137
142
```
138
143
139
- Similarly as we did before, we can add another filter on previous result. At
140
- this we will use the same filter , but now missing values wil be calculated
141
- across all samples and not only within "QC" group.
144
+ We can add another filter on top of the previous result. For this additional
145
+ filter we will use the same function call , but this time missing values will
146
+ be calculated across all samples and not only within the “QC” group.
142
147
143
148
``` {r}
144
149
MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9,
@@ -149,11 +154,11 @@ MTBLS79_filtered
149
154
sum(is.na(assay(MTBLS79_filtered)))
150
155
```
151
156
152
- Applying these 3 filters has reduced number of missing values from 18222 to
157
+ Applying these 3 filters has reduced the number of missing values from 18222 to
153
158
4779 .
154
159
155
- Commonly used approach in metabolomics studies is to filter features by the by
156
- coefficient of variation (CV) or RSD% of QC samples.Example below will use 30%
160
+ Another common filter approach is to filter features by the coefficient of
161
+ variation (CV) or RSD% of QC samples. The example shown below will use a 30%
157
162
threshold.
158
163
159
164
``` {r}
@@ -167,32 +172,37 @@ sum(is.na(assay(MTBLS79_filtered)))
167
172
168
173
# Processing history
169
174
170
- Every funcition in ` pmp ` provides history of applied parameter values. If user
171
- has saved outputs from R sessesion , it's easy to check what commands were
172
- executed.
175
+ Every function in ` pmp ` provides a history of parameter values that have been
176
+ applied. If a user has saved outputs from an R session , it’s also easy to check
177
+ what function calls were executed.
173
178
174
179
``` {r}
175
180
processing_history(MTBLS79_filtered)
176
181
```
177
182
178
183
# Data normalisation
179
184
180
- Probabilistic quotient normalisation (PQN) and normalisation the the total
181
- signal intensity methods are implemented for normalisation of biological
182
- variability across measured samples. Example below demonstrates how to apply
185
+ Next, we will apply probabilistic quotient normalisation (PQN).
183
186
PQN method.
184
187
185
188
``` {r}
186
189
MTBLS79_pqn_normalised <- pqn_normalisation(df=MTBLS79_filtered,
187
190
classes=MTBLS79_filtered$Class, qc_label="QC")
188
191
```
189
192
193
+ normalisation the the total
194
+ signal intensity methods are implemented for normalisation of biological
195
+ variability across measured samples. Example below demonstrates how to apply
196
+
190
197
# Missing value imputation
191
198
192
- Several commonly used missing value imputation algorithms. Supported methods
193
- are k-nearest neighbours (knn), random forests (rf), Bayesian PCA missing value
194
- estimator (bpca), mean or median value of the given feature and constant
195
- small value. Within ` mv_imputaion ` interface user can easily apply different
199
+ A unified function call for several commonly used missing value imputation
200
+ algorithms is also included in pmp. Supported methods are: k-nearest neighbours
201
+ (knn), random forests (rf), Bayesian PCA missing value estimator (bpca), mean
202
+ or median value of the given feature and a constant small value. In the example
203
+ below we will apply knn imputation.
204
+
205
+ Within ` mv_imputaion ` interface user can easily apply different
196
206
mehtod without worrying about input data type or tranposing dataset.
197
207
198
208
``` {r}
@@ -201,19 +211,18 @@ MTBLS79_mv_imputed <- mv_imputation(df=MTBLS79_pqn_normalised,
201
211
```
202
212
203
213
# Data scaling
204
-
205
- Variance stabilising generalised logarithm transformation (glog) algorithm is
206
- implimented to help to minimise contributions from unwanted technical
207
- variaton of sample collection.
214
+ The generalised logarithm (glog) transformation algorithm is available to
215
+ stabilise the variance across low and high intensity mass spectral features.
208
216
209
217
``` {r}
210
218
MTBLS79_glog <- glog_transformation(df=MTBLS79_mv_imputed,
211
219
classes=MTBLS79_filtered$Class, qc_label="QC")
212
220
```
213
221
214
222
` glog_transformation ` function uses QC samples to optimse scaling factor
215
- ` lambda ` . Using function ` glog_plot_plot_optimised_lambda ` it's possibe to
216
- visualise if optimsation of the given parameter has converged at the minima.
223
+ ` lambda ` . Using the function ` glog_plot_plot_optimised_lambda ` it's possible to
224
+ visualise if the optimsation of the given parameter has converged at the
225
+ minima.
217
226
218
227
``` {r plot_glog}
219
228
opt_lambda <-
@@ -225,10 +234,12 @@ glog_plot_optimised_lambda(df=MTBLS79_mv_imputed,
225
234
226
235
# Data integrity check and endomorphisms {#endomorphisms}
227
236
228
- Function in ` pmp ` package are designed to validate input data if user chose
229
- not to use ` r Biocpkg("SummarizedExperiment") ` class objet. For example, if
230
- input is ` matrix ` with features stored in columns and sample in rows, any
231
- function of ` pmp ` package will be able to handle this object.
237
+ Functions in the ` pmp ` package are designed to validate input data if the user
238
+ chooses not to use the ` r Biocpkg("SummarizedExperiment") ` class object.
239
+
240
+ For example, if the input ` matrix ` consists of features stored in columns and
241
+ samples in rows or * vice versa* , any function within the ` pmp ` package will be
242
+ able to handle this in the correct manner.
232
243
233
244
``` {r}
234
245
peak_matrix <- t(assay(MTBLS79))
@@ -254,9 +265,9 @@ class (rsd_filtered)
254
265
dim (rsd_filtered)
255
266
```
256
267
257
- Note that ` pmp ` has automatically transposed input object to use largest
258
- dimension as features, while original R data type ` matrix ` has been retained
259
- also for function output.
268
+ Note that ` pmp ` has automatically transposed the input object to use the
269
+ largest dimension as features, while the original R data type ` matrix ` has been
270
+ retained also for the function output.
260
271
261
272
# Session information
262
273
0 commit comments