|
| 1 | +--- |
| 2 | +title: "Introduction to Bulk RNAseq data analysis" |
| 3 | +subtitle: Differential Expression of RNA-seq data - Part 1 |
| 4 | +date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' |
| 5 | +output: |
| 6 | + html_document: |
| 7 | + toc: yes |
| 8 | + pdf_document: |
| 9 | + toc: yes |
| 10 | +bibliography: ref.bib |
| 11 | +--- |
| 12 | + |
| 13 | +```{r setup, echo=FALSE} |
| 14 | +options(tibble.print_max = 4, tibble.print_min = 4, max.print=40, |
| 15 | + tibble.max_extra_cols=2) |
| 16 | +``` |
| 17 | + |
| 18 | +## Load the data |
| 19 | + |
| 20 | +In the previous session we read the results from Salmon into R and created a |
| 21 | +`txi` object, which we then saved into an "rds" file. We can now load the txi |
| 22 | +from that file to start the differential expression analysis. We will also need |
| 23 | +the sample meta data sheet |
| 24 | + |
| 25 | +First load the packages we need. |
| 26 | + |
| 27 | +```{r message = FALSE} |
| 28 | +library(DESeq2) |
| 29 | +library(tidyverse) |
| 30 | +``` |
| 31 | + |
| 32 | +Now load the data from the earlier session. |
| 33 | + |
| 34 | +```{r loadData} |
| 35 | +txi <- readRDS("RObjects/txi.rds") |
| 36 | +sampleinfo <- read_tsv("data/samplesheet_corrected.tsv", col_types="cccc") |
| 37 | +``` |
| 38 | + |
| 39 | +It is important to be sure that the order of the samples in rows in the sample |
| 40 | +meta data table matches the order of the columns in the data matrix - `DESeq2` |
| 41 | +will **not** check this. If the order does not match you will not be running the |
| 42 | +analyses that you think you are. |
| 43 | + |
| 44 | +```{r checkSampleNames} |
| 45 | +all(colnames(txi$counts)==sampleinfo$SampleName) |
| 46 | +``` |
| 47 | + |
| 48 | +# The model formula and design matrices |
| 49 | + |
| 50 | +Now that we are happy that the quality of the data looks good, we can proceed to |
| 51 | +testing for differentially expressed genes. There are a number of packages to |
| 52 | +analyse RNA-Seq data. Most people use |
| 53 | +[DESeq2](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) |
| 54 | +[@Love2014] or |
| 55 | +[edgeR](http://bioconductor.org/packages/release/bioc/html/edgeR.html) |
| 56 | +[@Robinson2010; @McCarthy2012]. There is also the option to use the |
| 57 | +[limma](https://bioconductor.org/packages/release/bioc/html/limma.html) package |
| 58 | +and transform the counts using its `voom` function .They are all equally valid |
| 59 | +approaches [@Ritchie2015]. There is an informative and honest blog post |
| 60 | +[here](https://mikelove.wordpress.com/2016/09/28/deseq2-or-edger/) by Mike Love, |
| 61 | +one of the authors of DESeq2, about deciding which to use. |
| 62 | + |
| 63 | +We will use **DESeq2** for the rest of this practical. |
| 64 | + |
| 65 | +## Create a DESeqDataSet object with the raw data |
| 66 | + |
| 67 | +### Creating the design model formula |
| 68 | + |
| 69 | +First we need to create a design model formula for our analysis. `DESeq2` will |
| 70 | +use this to generate the model matrix, as we have seen in the linear models |
| 71 | +lecture. |
| 72 | + |
| 73 | +We have two variables in our experiment: "Status" and "Time Point". |
| 74 | + |
| 75 | +We will fit two models under two assumptions: no interaction and interaction of |
| 76 | +these two factors, however, to demonstrate the how `DESeq2` is used we will start |
| 77 | +with a simple model which considers Status but ignores Time Point. |
| 78 | + |
| 79 | +First, create a variable containing the model using standard R 'formula' syntax. |
| 80 | + |
| 81 | +```{r modelForumla} |
| 82 | +simple.model <- as.formula(~ Status) |
| 83 | +``` |
| 84 | + |
| 85 | +What does this look like as a model matrix? |
| 86 | + |
| 87 | +```{r modelMatrix} |
| 88 | +model.matrix(simple.model, data = sampleinfo) |
| 89 | +``` |
| 90 | + |
| 91 | +The intercept has been set automatically to the group in the factor that is |
| 92 | +alphabetically first: `Infected`. |
| 93 | + |
| 94 | +It would be nice if `Uninfected` were the base line/intercept. To get R to |
| 95 | +use `Uninfected` as the intercept we need to use a `factor`. Let's set factor |
| 96 | +levels on Status to use `Uninfected` as the intercept. |
| 97 | + |
| 98 | +```{r setFactors} |
| 99 | +sampleinfo <- mutate(sampleinfo, Status = fct_relevel(Status, "Uninfected")) |
| 100 | +model.matrix(simple.model, data = sampleinfo) |
| 101 | +``` |
| 102 | + |
| 103 | +# Build a DESeq2DataSet |
| 104 | + |
| 105 | +We don't actually need to pass `DESeq2` the model matrix, instead we pass it the |
| 106 | +design formula and the `sampleinfo` it will build the matrix itself. |
| 107 | + |
| 108 | +```{r makeDDSObj} |
| 109 | +# create the DESeqDataSet object |
| 110 | +ddsObj.raw <- DESeqDataSetFromTximport(txi = txi, |
| 111 | + colData = sampleinfo, |
| 112 | + design = simple.model) |
| 113 | +``` |
| 114 | + |
| 115 | +When we summarised the counts to gene level, `tximport` also calculated an |
| 116 | +average transcript length for each gene for each sample. For a given gene the |
| 117 | +average transcript length may vary between samples if different samples are |
| 118 | +using alternative transcripts. `DESeq2` will incorporate this into its |
| 119 | +"normalisation". |
| 120 | + |
| 121 | +# Filter out the unexpressed genes |
| 122 | + |
| 123 | +Just as we did in session 7, we should filter out genes that uninformative. |
| 124 | + |
| 125 | +```{r} |
| 126 | +keep <- rowSums(counts(ddsObj.raw)) > 5 |
| 127 | +ddsObj.filt <- ddsObj.raw[keep,] |
| 128 | +``` |
| 129 | + |
| 130 | +# Differential expression analysis with DESeq2 |
| 131 | + |
| 132 | +## The `DESeq2` work flow |
| 133 | + |
| 134 | +The main `DESeq2` work flow is carried out in 3 steps: |
| 135 | + |
| 136 | +### `estimateSizeFactors` |
| 137 | + |
| 138 | +First, Calculate the "median ratio" normalisation size factors for each sample |
| 139 | +and adjust for average transcript length on a per gene per sample basis. |
| 140 | + |
| 141 | +```{r commonSizeFactors} |
| 142 | +ddsObj <- estimateSizeFactors(ddsObj.filt) |
| 143 | +``` |
| 144 | + |
| 145 | +#### Let's have a look at what that did |
| 146 | + |
| 147 | +`DESeq2` has calculated a normalizsation factor for each gene for each sample. |
| 148 | + |
| 149 | +```{r} |
| 150 | +normalizationFactors(ddsObj.filt) |
| 151 | +normalizationFactors(ddsObj) |
| 152 | +``` |
| 153 | + |
| 154 | +We can use `plotMA` from `limma` to look at the of these normalisation factors |
| 155 | +on data in an MA plot. Let's look at **SRR7657882**, the fifth column, which has |
| 156 | +the largest normalisation factors. |
| 157 | + |
| 158 | +```{r} |
| 159 | +logcounts <- log2(counts(ddsObj, normalized=FALSE) + 1) |
| 160 | +
|
| 161 | +limma::plotMA(logcounts, array = 5, ylim =c(-5, 5)) |
| 162 | +abline(h=0, col="red") |
| 163 | +``` |
| 164 | + |
| 165 | +```{r} |
| 166 | +logNormalizedCounts <- log2(counts(ddsObj, normalized=TRUE) + 1) |
| 167 | +
|
| 168 | +limma::plotMA(logNormalizedCounts, array=5, ylim =c(-5, 5)) |
| 169 | +abline(h=0, col="red") |
| 170 | +``` |
| 171 | + |
| 172 | +DESeq2 doesn't actually normalise the counts, it uses raw counts and includes |
| 173 | +the normalisation factors in the modeling as an "offset". Please see the DESeq2 |
| 174 | +documentation if you'd like more details on exactly how they are incorporated |
| 175 | +into the algorithm. For practical purposes we can think of it as a normalisation. |
| 176 | + |
| 177 | +### `estimateDispersions` |
| 178 | + |
| 179 | +Next we need to estimate the dispersion parameters for each gene. |
| 180 | + |
| 181 | +```{r genewiseDispersion} |
| 182 | +ddsObj <- estimateDispersions(ddsObj) |
| 183 | +``` |
| 184 | + |
| 185 | +We can plot all three sets of dispersion estimates. It is particularly important |
| 186 | +to do this if you change any of the default parameters for this step. |
| 187 | + |
| 188 | +```{r plotDisp} |
| 189 | +plotDispEsts(ddsObj) |
| 190 | +``` |
| 191 | + |
| 192 | + |
| 193 | +### `nbinomWaldTest` |
| 194 | + |
| 195 | +Finally, apply Negative Binomial GLM fitting and calculate Wald statistics. |
| 196 | + |
| 197 | +```{r applyGLM} |
| 198 | +ddsObj <- nbinomWaldTest(ddsObj) |
| 199 | +``` |
| 200 | + |
| 201 | +## The `DESeq` command |
| 202 | + |
| 203 | +In practice the 3 steps above can be performed in a single step using the |
| 204 | +`DESeq` wrapper function. Performing the three steps separately is useful if you |
| 205 | +wish to alter the default parameters of one or more steps, otherwise the `DESeq` |
| 206 | +function is fine. |
| 207 | + |
| 208 | +```{r theShortVersion} |
| 209 | +ddsObj <- DESeq(ddsObj.filt) |
| 210 | +``` |
| 211 | + |
| 212 | +## Generate a results table |
| 213 | + |
| 214 | +We can generate a table of differential expression results from the DDS object |
| 215 | +using the `results` function of DESeq2. |
| 216 | + |
| 217 | +```{r resultsTable} |
| 218 | +results.simple <- results(ddsObj, alpha=0.05) |
| 219 | +results.simple |
| 220 | +``` |
| 221 | + |
| 222 | +How many genes have an adjusted p-value of less than 0.05? |
| 223 | + |
| 224 | +```{r countSigGenes} |
| 225 | +sum(results.simple$padj < 0.05) |
| 226 | +``` |
| 227 | + |
| 228 | + |
| 229 | +### Independent filtering |
| 230 | + |
| 231 | +You will notice that some of the adjusted p-values (`padj`) are NA. |
| 232 | + |
| 233 | +```{r} |
| 234 | +sum(is.na(results.simple$padj)) |
| 235 | +``` |
| 236 | + |
| 237 | +Remember |
| 238 | +in Session 7 we said that there is no need to pre-filter the genes as DESeq2 |
| 239 | +will do this through a process it calls 'independent filtering'. The genes |
| 240 | +with `NA` are the ones `DESeq2` has filtered out. |
| 241 | + |
| 242 | +From `DESeq2` manual: |
| 243 | +"The results function of the `DESeq2` package performs independent filtering by |
| 244 | +default using the mean of normalized counts as a filter statistic. A threshold |
| 245 | +on the filter statistic is found which optimizes the number of adjusted p values |
| 246 | +lower than a [specified] significance level". |
| 247 | + |
| 248 | +The default significance level for independent filtering is `0.1`, however, you |
| 249 | +should set this to the FDR cut off you are planning to use. We will use `0.05` - |
| 250 | +this was the purpose of the `alpha` argument in the previous command. |
| 251 | + |
| 252 | +How many genes have an adjusted p-value of less than 0.05? |
| 253 | + |
| 254 | +```{r countSigGenes2} |
| 255 | +sum(results.simple$padj < 0.05, na.rm = TRUE) |
| 256 | +``` |
| 257 | + |
| 258 | +How many genes are up-regulated? |
| 259 | + |
| 260 | +```{r countSigUpGenes2} |
| 261 | +sum(results.simple$padj < 0.05 & results.simple$log2FoldChange > 0, na.rm = TRUE) |
| 262 | +``` |
| 263 | + |
| 264 | +How many genes are down-regulated? |
| 265 | + |
| 266 | +```{r countSigDownGenes2} |
| 267 | +sum(results.simple$padj < 0.05 & results.simple$log2FoldChange < 0, na.rm = TRUE) |
| 268 | +``` |
| 269 | + |
| 270 | +### Exercise 1 |
| 271 | + |
| 272 | +Now we have made our results table using our simple model, how many genes are: |
| 273 | + |
| 274 | +a) significantly up-regulated? |
| 275 | + |
| 276 | +b) significantly down-regulated? |
0 commit comments