Skip to content

Commit 8afb658

Browse files
committed
Vignette: Typos, grammar and readability improvements.
1 parent ab3a168 commit 8afb658

File tree

1 file changed

+80
-69
lines changed

1 file changed

+80
-69
lines changed

vignettes/pmp_vignette_peak_matrix_processing_for_metabolomics_datasets.Rmd

Lines changed: 80 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: "Peak Matrix Processing for metabolomics datasets"
33
author:
44
name: "Andris Jankevics"
5-
affiliation: Phenome Centre Birmingham, University of Birmingham
5+
affiliation: Phenome Centre Birmingham, University of Birmingham, UK
66
77

88
package: pmp
@@ -30,8 +30,34 @@ knitr::opts_chunk$set(
3030
)
3131
```
3232

33+
# Introduction
34+
35+
Metabolomics data (pre-)processing workflows consist of multiple steps
36+
including peak picking, quality assessment, missing value imputation,
37+
normalisation and scaling. Several software solutions (commercial and
38+
open-source) are available for raw data processing, including r-package XCMS,
39+
to generate processed outputs in the form of a two dimensional data matrix.
40+
41+
These outputs contain hundreds or thousands of so called "uninformative" or
42+
"irreproducible" features. Such features could strongly hinder outputs of
43+
subsequent statistical analysis, biomarker discovery or metabolic pathway
44+
inference. Common practice is to apply peak matrix validation and filtering
45+
procedures as described in @guida2016, @broadhurst2018 and @schiffman2019.
46+
47+
Functions within the `pmp` (Peak Matrix Processing) package are designed to
48+
help users to prepare data for further statistical data analysis in a fast,
49+
easy to use and reproducible manner.
50+
51+
This vignette showcases a range of commonly applied Peak Matrix Processing
52+
steps for metabolomics datasets.
53+
3354
# Installation
3455

56+
You should have R version 4.0.0 or above and Rstudio installed to be able to
57+
run this notebook.
58+
59+
Execute following commands from the R terminal.
60+
3561
```{r eval=FALSE, include=TRUE}
3662
if (!requireNamespace("BiocManager", quietly = TRUE))
3763
install.packages("BiocManager")
@@ -44,48 +70,27 @@ library(SummarizedExperiment)
4470
library(S4Vectors)
4571
```
4672

47-
# Introduction
48-
49-
Metabolomics data processing workflows consist of multiple steps including peak
50-
picking or raw data processing, quality assurance, missing value
51-
imputation, normalisation and scaling. Several tools (commercial,
52-
R and non-R based) are commonly used for raw data processing
53-
which generate outputs in the form of a two dimensional data matrix and meta
54-
data.
55-
56-
These outputs contain hundreds or thousands of so called uninformative or
57-
unreproducible features. Such features could strongly hinder outputs of
58-
subsequent statistical analysis, biomarker discovery or metabolic
59-
pathway inference. Common practice is to apply peak matrix validation and
60-
filtering procedures as described in @guida2016, @broadhurst2018 and
61-
@schiffman2019.
62-
63-
Functions within `pmp` package are designed to help users to prepare data for
64-
further statistical data analysis in fast, easy to use and reproducible manner.
65-
66-
This document showcases the commonly used peak matrix processing steps of
67-
metabolomics datasets.
68-
6973
# Data formats
7074

71-
Recent review for R packages in metabolomics
72-
[@stanstrup2019] covers a broad range of heterogenous tools is availiable as
73-
part of `Bioconductor` sofware collection or on `CRAN`, `Github` and similar
74-
public repositories. `pmp` package utilises `r Biocpkg("SummarizedExperiment")`
75-
class from Bioconductor for data input and output.
75+
Recently a review by [@stanstrup2019] reported and discussed a broad range of
76+
heterogeneous R tools and packages that are available via `Bioconductor`,
77+
`CRAN`, `Github` and similar public repositories.
78+
79+
`pmp` package utilises `r Biocpkg("SummarizedExperiment")` class from
80+
Bioconductor for data input and output.
7681

7782
For example, outputs from widely used `r Biocpkg("xcms")` package can be
78-
relatively easy converted to `SummarizedExperiment` object using functions
79-
`featureDefinitions`, `featureValues` and `pData` on `xcms` output object.
83+
converted to a `SummarizedExperiment` object using functions
84+
`featureDefinitions`, `featureValues` and `pData` on the `xcms` output object.
8085

81-
Additioanlly `pmp` supports to input data to be any matrix-like `R` data
82-
structure (e.g. and ordinary matrix, a data frame). If input if a matrix-like
83-
structure tools from `pmp` package will perform several checks for data
84-
integrity as well. Please see section \@ref(endomorphisms) for more details.
86+
Additionally `pmp` also supports any matrix-like `R` data
87+
structure (e.g. an ordinary matrix, a data frame) as an input. If the input is
88+
a matrix-like structure `pmp` will perform several checks for data integrity.
89+
Please see section \@ref(endomorphisms) for more details.
8590

8691
# Example dataset, MTBLS79
8792

88-
In this tutorial we will be using an direct infusion mass spectrometry (DIMS)
93+
In this tutorial we will be using a direct infusion mass spectrometry (DIMS)
8994
dataset consisting of 172 samples measured across 8 batches and is included in
9095
`pmp` package as `SummarizedExperiemnt` class object `MTBLS79`.
9196
More detailed description of the dataset is available from @kirwan2014,
@@ -121,11 +126,11 @@ MTBLS79_filtered
121126
sum(is.na(assay(MTBLS79_filtered)))
122127
```
123128

124-
Missing values sample filter has removed two samples from the initial dataset.
125-
Outputs from any `pmp` function can be used as inputs for another function. For
126-
example we can apply missing value filter across features on the output of the
127-
previous command. Command below will filter only within quality control (QC)
128-
sample group.
129+
Missing values sample filter has removed two samples from the dataset.
130+
Outputs from any `pmp` function can be used as inputs for another `pmp`
131+
function. For example we can apply missing value filter across features on the
132+
output of the previous call. The function call below will filter features based
133+
on the quality control (QC) sample group only.
129134

130135
```{r}
131136
MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9,
@@ -136,9 +141,9 @@ MTBLS79_filtered
136141
sum(is.na(assay(MTBLS79_filtered)))
137142
```
138143

139-
Similarly as we did before, we can add another filter on previous result. At
140-
this we will use the same filter, but now missing values wil be calculated
141-
across all samples and not only within "QC" group.
144+
We can add another filter on top of the previous result. For this additional
145+
filter we will use the same function call, but this time missing values will
146+
be calculated across all samples and not only within the “QC” group.
142147

143148
```{r}
144149
MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9,
@@ -149,11 +154,11 @@ MTBLS79_filtered
149154
sum(is.na(assay(MTBLS79_filtered)))
150155
```
151156

152-
Applying these 3 filters has reduced number of missing values from 18222 to
157+
Applying these 3 filters has reduced the number of missing values from 18222 to
153158
4779.
154159

155-
Commonly used approach in metabolomics studies is to filter features by the by
156-
coefficient of variation (CV) or RSD% of QC samples.Example below will use 30%
160+
Another common filter approach is to filter features by the coefficient of
161+
variation (CV) or RSD% of QC samples. The example shown below will use a 30%
157162
threshold.
158163

159164
```{r}
@@ -167,32 +172,37 @@ sum(is.na(assay(MTBLS79_filtered)))
167172

168173
# Processing history
169174

170-
Every funcition in `pmp` provides history of applied parameter values. If user
171-
has saved outputs from R sessesion, it's easy to check what commands were
172-
executed.
175+
Every function in `pmp` provides a history of parameter values that have been
176+
applied. If a user has saved outputs from an R session, it’s also easy to check
177+
what function calls were executed.
173178

174179
```{r}
175180
processing_history(MTBLS79_filtered)
176181
```
177182

178183
# Data normalisation
179184

180-
Probabilistic quotient normalisation (PQN) and normalisation the the total
181-
signal intensity methods are implemented for normalisation of biological
182-
variability across measured samples. Example below demonstrates how to apply
185+
Next, we will apply probabilistic quotient normalisation (PQN).
183186
PQN method.
184187

185188
```{r}
186189
MTBLS79_pqn_normalised <- pqn_normalisation(df=MTBLS79_filtered,
187190
classes=MTBLS79_filtered$Class, qc_label="QC")
188191
```
189192

193+
normalisation the the total
194+
signal intensity methods are implemented for normalisation of biological
195+
variability across measured samples. Example below demonstrates how to apply
196+
190197
# Missing value imputation
191198

192-
Several commonly used missing value imputation algorithms. Supported methods
193-
are k-nearest neighbours (knn), random forests (rf), Bayesian PCA missing value
194-
estimator (bpca), mean or median value of the given feature and constant
195-
small value. Within `mv_imputaion` interface user can easily apply different
199+
A unified function call for several commonly used missing value imputation
200+
algorithms is also included in pmp. Supported methods are: k-nearest neighbours
201+
(knn), random forests (rf), Bayesian PCA missing value estimator (bpca), mean
202+
or median value of the given feature and a constant small value. In the example
203+
below we will apply knn imputation.
204+
205+
Within `mv_imputaion` interface user can easily apply different
196206
mehtod without worrying about input data type or tranposing dataset.
197207

198208
```{r}
@@ -201,19 +211,18 @@ MTBLS79_mv_imputed <- mv_imputation(df=MTBLS79_pqn_normalised,
201211
```
202212

203213
# Data scaling
204-
205-
Variance stabilising generalised logarithm transformation (glog) algorithm is
206-
implimented to help to minimise contributions from unwanted technical
207-
variaton of sample collection.
214+
The generalised logarithm (glog) transformation algorithm is available to
215+
stabilise the variance across low and high intensity mass spectral features.
208216

209217
```{r}
210218
MTBLS79_glog <- glog_transformation(df=MTBLS79_mv_imputed,
211219
classes=MTBLS79_filtered$Class, qc_label="QC")
212220
```
213221

214222
`glog_transformation` function uses QC samples to optimse scaling factor
215-
`lambda`. Using function `glog_plot_plot_optimised_lambda` it's possibe to
216-
visualise if optimsation of the given parameter has converged at the minima.
223+
`lambda`. Using the function `glog_plot_plot_optimised_lambda` it's possible to
224+
visualise if the optimsation of the given parameter has converged at the
225+
minima.
217226

218227
```{r plot_glog}
219228
opt_lambda <-
@@ -225,10 +234,12 @@ glog_plot_optimised_lambda(df=MTBLS79_mv_imputed,
225234

226235
# Data integrity check and endomorphisms {#endomorphisms}
227236

228-
Function in `pmp` package are designed to validate input data if user chose
229-
not to use `r Biocpkg("SummarizedExperiment")` class objet. For example, if
230-
input is `matrix` with features stored in columns and sample in rows, any
231-
function of `pmp` package will be able to handle this object.
237+
Functions in the `pmp` package are designed to validate input data if the user
238+
chooses not to use the `r Biocpkg("SummarizedExperiment")` class object.
239+
240+
For example, if the input `matrix` consists of features stored in columns and
241+
samples in rows or *vice versa*, any function within the `pmp` package will be
242+
able to handle this in the correct manner.
232243

233244
```{r}
234245
peak_matrix <- t(assay(MTBLS79))
@@ -254,9 +265,9 @@ class (rsd_filtered)
254265
dim (rsd_filtered)
255266
```
256267

257-
Note that `pmp` has automatically transposed input object to use largest
258-
dimension as features, while original R data type `matrix` has been retained
259-
also for function output.
268+
Note that `pmp` has automatically transposed the input object to use the
269+
largest dimension as features, while the original R data type `matrix` has been
270+
retained also for the function output.
260271

261272
# Session information
262273

0 commit comments

Comments
 (0)