Skip to content

Commit 2a8009d

Browse files
committed
test push
1 parent 7e4c00c commit 2a8009d

File tree

2 files changed

+589
-0
lines changed

2 files changed

+589
-0
lines changed
Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
---
2+
title: "Introduction to Bulk RNAseq data analysis"
3+
subtitle: Differential Expression of RNA-seq data - Part 1
4+
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
5+
output:
6+
html_document:
7+
toc: yes
8+
pdf_document:
9+
toc: yes
10+
bibliography: ref.bib
11+
---
12+
13+
```{r setup, echo=FALSE}
14+
options(tibble.print_max = 4, tibble.print_min = 4, max.print=40,
15+
tibble.max_extra_cols=2)
16+
```
17+
18+
## Load the data
19+
20+
In the previous session we read the results from Salmon into R and created a
21+
`txi` object, which we then saved into an "rds" file. We can now load the txi
22+
from that file to start the differential expression analysis. We will also need
23+
the sample meta data sheet
24+
25+
First load the packages we need.
26+
27+
```{r message = FALSE}
28+
library(DESeq2)
29+
library(tidyverse)
30+
```
31+
32+
Now load the data from the earlier session.
33+
34+
```{r loadData}
35+
txi <- readRDS("RObjects/txi.rds")
36+
sampleinfo <- read_tsv("data/samplesheet_corrected.tsv", col_types="cccc")
37+
```
38+
39+
It is important to be sure that the order of the samples in rows in the sample
40+
meta data table matches the order of the columns in the data matrix - `DESeq2`
41+
will **not** check this. If the order does not match you will not be running the
42+
analyses that you think you are.
43+
44+
```{r checkSampleNames}
45+
all(colnames(txi$counts)==sampleinfo$SampleName)
46+
```
47+
48+
# The model formula and design matrices
49+
50+
Now that we are happy that the quality of the data looks good, we can proceed to
51+
testing for differentially expressed genes. There are a number of packages to
52+
analyse RNA-Seq data. Most people use
53+
[DESeq2](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
54+
[@Love2014] or
55+
[edgeR](http://bioconductor.org/packages/release/bioc/html/edgeR.html)
56+
[@Robinson2010; @McCarthy2012]. There is also the option to use the
57+
[limma](https://bioconductor.org/packages/release/bioc/html/limma.html) package
58+
and transform the counts using its `voom` function .They are all equally valid
59+
approaches [@Ritchie2015]. There is an informative and honest blog post
60+
[here](https://mikelove.wordpress.com/2016/09/28/deseq2-or-edger/) by Mike Love,
61+
one of the authors of DESeq2, about deciding which to use.
62+
63+
We will use **DESeq2** for the rest of this practical.
64+
65+
## Create a DESeqDataSet object with the raw data
66+
67+
### Creating the design model formula
68+
69+
First we need to create a design model formula for our analysis. `DESeq2` will
70+
use this to generate the model matrix, as we have seen in the linear models
71+
lecture.
72+
73+
We have two variables in our experiment: "Status" and "Time Point".
74+
75+
We will fit two models under two assumptions: no interaction and interaction of
76+
these two factors, however, to demonstrate the how `DESeq2` is used we will start
77+
with a simple model which considers Status but ignores Time Point.
78+
79+
First, create a variable containing the model using standard R 'formula' syntax.
80+
81+
```{r modelForumla}
82+
simple.model <- as.formula(~ Status)
83+
```
84+
85+
What does this look like as a model matrix?
86+
87+
```{r modelMatrix}
88+
model.matrix(simple.model, data = sampleinfo)
89+
```
90+
91+
The intercept has been set automatically to the group in the factor that is
92+
alphabetically first: `Infected`.
93+
94+
It would be nice if `Uninfected` were the base line/intercept. To get R to
95+
use `Uninfected` as the intercept we need to use a `factor`. Let's set factor
96+
levels on Status to use `Uninfected` as the intercept.
97+
98+
```{r setFactors}
99+
sampleinfo <- mutate(sampleinfo, Status = fct_relevel(Status, "Uninfected"))
100+
model.matrix(simple.model, data = sampleinfo)
101+
```
102+
103+
# Build a DESeq2DataSet
104+
105+
We don't actually need to pass `DESeq2` the model matrix, instead we pass it the
106+
design formula and the `sampleinfo` it will build the matrix itself.
107+
108+
```{r makeDDSObj}
109+
# create the DESeqDataSet object
110+
ddsObj.raw <- DESeqDataSetFromTximport(txi = txi,
111+
colData = sampleinfo,
112+
design = simple.model)
113+
```
114+
115+
When we summarised the counts to gene level, `tximport` also calculated an
116+
average transcript length for each gene for each sample. For a given gene the
117+
average transcript length may vary between samples if different samples are
118+
using alternative transcripts. `DESeq2` will incorporate this into its
119+
"normalisation".
120+
121+
# Filter out the unexpressed genes
122+
123+
Just as we did in session 7, we should filter out genes that uninformative.
124+
125+
```{r}
126+
keep <- rowSums(counts(ddsObj.raw)) > 5
127+
ddsObj.filt <- ddsObj.raw[keep,]
128+
```
129+
130+
# Differential expression analysis with DESeq2
131+
132+
## The `DESeq2` work flow
133+
134+
The main `DESeq2` work flow is carried out in 3 steps:
135+
136+
### `estimateSizeFactors`
137+
138+
First, Calculate the "median ratio" normalisation size factors for each sample
139+
and adjust for average transcript length on a per gene per sample basis.
140+
141+
```{r commonSizeFactors}
142+
ddsObj <- estimateSizeFactors(ddsObj.filt)
143+
```
144+
145+
#### Let's have a look at what that did
146+
147+
`DESeq2` has calculated a normalizsation factor for each gene for each sample.
148+
149+
```{r}
150+
normalizationFactors(ddsObj.filt)
151+
normalizationFactors(ddsObj)
152+
```
153+
154+
We can use `plotMA` from `limma` to look at the of these normalisation factors
155+
on data in an MA plot. Let's look at **SRR7657882**, the fifth column, which has
156+
the largest normalisation factors.
157+
158+
```{r}
159+
logcounts <- log2(counts(ddsObj, normalized=FALSE) + 1)
160+
161+
limma::plotMA(logcounts, array = 5, ylim =c(-5, 5))
162+
abline(h=0, col="red")
163+
```
164+
165+
```{r}
166+
logNormalizedCounts <- log2(counts(ddsObj, normalized=TRUE) + 1)
167+
168+
limma::plotMA(logNormalizedCounts, array=5, ylim =c(-5, 5))
169+
abline(h=0, col="red")
170+
```
171+
172+
DESeq2 doesn't actually normalise the counts, it uses raw counts and includes
173+
the normalisation factors in the modeling as an "offset". Please see the DESeq2
174+
documentation if you'd like more details on exactly how they are incorporated
175+
into the algorithm. For practical purposes we can think of it as a normalisation.
176+
177+
### `estimateDispersions`
178+
179+
Next we need to estimate the dispersion parameters for each gene.
180+
181+
```{r genewiseDispersion}
182+
ddsObj <- estimateDispersions(ddsObj)
183+
```
184+
185+
We can plot all three sets of dispersion estimates. It is particularly important
186+
to do this if you change any of the default parameters for this step.
187+
188+
```{r plotDisp}
189+
plotDispEsts(ddsObj)
190+
```
191+
192+
193+
### `nbinomWaldTest`
194+
195+
Finally, apply Negative Binomial GLM fitting and calculate Wald statistics.
196+
197+
```{r applyGLM}
198+
ddsObj <- nbinomWaldTest(ddsObj)
199+
```
200+
201+
## The `DESeq` command
202+
203+
In practice the 3 steps above can be performed in a single step using the
204+
`DESeq` wrapper function. Performing the three steps separately is useful if you
205+
wish to alter the default parameters of one or more steps, otherwise the `DESeq`
206+
function is fine.
207+
208+
```{r theShortVersion}
209+
ddsObj <- DESeq(ddsObj.filt)
210+
```
211+
212+
## Generate a results table
213+
214+
We can generate a table of differential expression results from the DDS object
215+
using the `results` function of DESeq2.
216+
217+
```{r resultsTable}
218+
results.simple <- results(ddsObj, alpha=0.05)
219+
results.simple
220+
```
221+
222+
How many genes have an adjusted p-value of less than 0.05?
223+
224+
```{r countSigGenes}
225+
sum(results.simple$padj < 0.05)
226+
```
227+
228+
229+
### Independent filtering
230+
231+
You will notice that some of the adjusted p-values (`padj`) are NA.
232+
233+
```{r}
234+
sum(is.na(results.simple$padj))
235+
```
236+
237+
Remember
238+
in Session 7 we said that there is no need to pre-filter the genes as DESeq2
239+
will do this through a process it calls 'independent filtering'. The genes
240+
with `NA` are the ones `DESeq2` has filtered out.
241+
242+
From `DESeq2` manual:
243+
"The results function of the `DESeq2` package performs independent filtering by
244+
default using the mean of normalized counts as a filter statistic. A threshold
245+
on the filter statistic is found which optimizes the number of adjusted p values
246+
lower than a [specified] significance level".
247+
248+
The default significance level for independent filtering is `0.1`, however, you
249+
should set this to the FDR cut off you are planning to use. We will use `0.05` -
250+
this was the purpose of the `alpha` argument in the previous command.
251+
252+
How many genes have an adjusted p-value of less than 0.05?
253+
254+
```{r countSigGenes2}
255+
sum(results.simple$padj < 0.05, na.rm = TRUE)
256+
```
257+
258+
How many genes are up-regulated?
259+
260+
```{r countSigUpGenes2}
261+
sum(results.simple$padj < 0.05 & results.simple$log2FoldChange > 0, na.rm = TRUE)
262+
```
263+
264+
How many genes are down-regulated?
265+
266+
```{r countSigDownGenes2}
267+
sum(results.simple$padj < 0.05 & results.simple$log2FoldChange < 0, na.rm = TRUE)
268+
```
269+
270+
### Exercise 1
271+
272+
Now we have made our results table using our simple model, how many genes are:
273+
274+
a) significantly up-regulated?
275+
276+
b) significantly down-regulated?

0 commit comments

Comments
 (0)