bioinformatics-core-shared-training
diff --git a/‎Markdowns/10_DE_analysis_with_DESeq2_part1.Rmd‎
Lines changed: 276 additions & 0 deletions b/‎Markdowns/10_DE_analysis_with_DESeq2_part1.Rmd‎
Lines changed: 276 additions & 0 deletions
@@ -0,0 +1,276 @@
+---
+title: "Introduction to Bulk RNAseq data analysis"
+subtitle: Differential Expression of RNA-seq data - Part 1
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+  html_document:
+    toc: yes
+  pdf_document:
+    toc: yes
+bibliography: ref.bib
+---
+
+```{r setup, echo=FALSE}
+options(tibble.print_max = 4, tibble.print_min = 4, max.print=40, 
+        tibble.max_extra_cols=2)
+```
+
+## Load the data
+
+In the previous session we read the results from Salmon into R and created a 
+`txi` object, which we then saved into an "rds" file. We can now load the txi
+from that file to start the differential expression analysis. We will also need
+the sample meta data sheet
+
+First load the packages we need.
+
+```{r message = FALSE}
+library(DESeq2)
+library(tidyverse)
+```
+
+Now load the data from the earlier session.
+
+```{r loadData}
+txi <- readRDS("RObjects/txi.rds")
+sampleinfo <- read_tsv("data/samplesheet_corrected.tsv", col_types="cccc")
+```
+
+It is important to be sure that the order of the samples in rows in the sample 
+meta data table matches the order of the columns in the data matrix - `DESeq2`
+will **not** check this. If the order does not match you will not be running the
+analyses that you think you are.
+
+```{r checkSampleNames}
+all(colnames(txi$counts)==sampleinfo$SampleName)
+```
+
+# The model formula and design matrices
+
+Now that we are happy that the quality of the data looks good, we can proceed to
+testing for differentially expressed genes. There are a number of packages to
+analyse RNA-Seq data. Most people use
+[DESeq2](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
+[@Love2014] or
+[edgeR](http://bioconductor.org/packages/release/bioc/html/edgeR.html)
+[@Robinson2010; @McCarthy2012]. There is also the option to use the
+[limma](https://bioconductor.org/packages/release/bioc/html/limma.html) package
+and transform the counts using its `voom` function .They are all equally valid
+approaches [@Ritchie2015]. There is an informative and honest blog post
+[here](https://mikelove.wordpress.com/2016/09/28/deseq2-or-edger/) by Mike Love,
+one of the authors of DESeq2, about deciding which to use.
+
+We will use **DESeq2** for the rest of this practical.
+
+## Create a DESeqDataSet object with the raw data
+
+### Creating the design model formula
+
+First we need to create a design model formula for our analysis. `DESeq2` will 
+use this to generate the model matrix, as we have seen in the linear models 
+lecture. 
+
+We have two variables in our experiment: "Status" and "Time Point". 
+
+We will fit two models under two assumptions: no interaction and interaction of
+these two factors, however, to demonstrate the how `DESeq2` is used we will start
+with a simple model which considers Status but ignores Time Point.  
+
+First, create a variable containing the model using standard R 'formula' syntax.
+
+```{r modelForumla}
+simple.model <- as.formula(~ Status)
+```
+
+What does this look like as a model matrix?
+
+```{r modelMatrix}
+model.matrix(simple.model, data = sampleinfo)
+```
+
+The intercept has been set automatically to the group in the factor that is
+alphabetically first: `Infected`.
+
+It would be nice if `Uninfected` were the base line/intercept. To get R to 
+use `Uninfected` as the intercept we need to use a `factor`. Let's set factor 
+levels on Status to use `Uninfected` as the intercept.
+
+```{r setFactors}
+sampleinfo <- mutate(sampleinfo, Status = fct_relevel(Status, "Uninfected"))
+model.matrix(simple.model, data = sampleinfo)
+```
+
+# Build a DESeq2DataSet
+
+We don't actually need to pass `DESeq2` the model matrix, instead we pass it the 
+design formula and the `sampleinfo` it will build the matrix itself.
+
+```{r makeDDSObj}
+# create the DESeqDataSet object
+ddsObj.raw <- DESeqDataSetFromTximport(txi = txi,
+                                       colData = sampleinfo,
+                                       design = simple.model)
+```
+
+When we summarised the counts to gene level, `tximport` also calculated an 
+average transcript length for each gene for each sample. For a given gene the
+average transcript length may vary between samples if different samples are 
+using alternative transcripts. `DESeq2` will incorporate this into its 
+"normalisation".
+
+# Filter out the unexpressed genes
+
+Just as we did in session 7, we should filter out genes that uninformative.
+
+```{r}
+keep <- rowSums(counts(ddsObj.raw)) > 5
+ddsObj.filt <- ddsObj.raw[keep,]
+```
+
+# Differential expression analysis with DESeq2
+
+## The `DESeq2` work flow
+
+The main `DESeq2` work flow is carried out in 3 steps:
+
+### `estimateSizeFactors`
+
+First, Calculate the "median ratio" normalisation size factors for each sample 
+and adjust for average transcript length on a per gene per sample basis.
+
+```{r commonSizeFactors}
+ddsObj <- estimateSizeFactors(ddsObj.filt)
+```
+
+#### Let's have a look at what that did
+
+`DESeq2` has calculated a normalizsation factor for each gene for each sample.
+
+```{r}
+normalizationFactors(ddsObj.filt)
+normalizationFactors(ddsObj)
+```
+
+We can use `plotMA` from `limma` to look at the of these normalisation factors
+on data in an MA plot. Let's look at **SRR7657882**, the fifth column, which has
+the largest normalisation factors.
+
+```{r}
+logcounts <- log2(counts(ddsObj, normalized=FALSE)  + 1)
+
+limma::plotMA(logcounts, array = 5, ylim =c(-5, 5))
+abline(h=0, col="red")
+```
+
+```{r}
+logNormalizedCounts <- log2(counts(ddsObj, normalized=TRUE)  + 1)
+
+limma::plotMA(logNormalizedCounts, array=5, ylim =c(-5, 5))
+abline(h=0, col="red")
+```
+
+DESeq2 doesn't actually normalise the counts, it uses raw counts and includes
+the normalisation factors in the modeling as an "offset". Please see the DESeq2
+documentation if you'd like more details on exactly how they are incorporated 
+into the algorithm. For practical purposes we can think of it as a normalisation.
+
+### `estimateDispersions`
+
+Next we need to estimate the dispersion parameters for each gene.
+
+```{r genewiseDispersion}
+ddsObj <- estimateDispersions(ddsObj)
+```
+
+We can plot all three sets of dispersion estimates. It is particularly important
+to do this if you change any of the default parameters for this step.
+
+```{r plotDisp}
+plotDispEsts(ddsObj)
+```
+
+
+### `nbinomWaldTest`
+
+Finally, apply Negative Binomial GLM fitting and calculate Wald statistics.
+
+```{r applyGLM}
+ddsObj <- nbinomWaldTest(ddsObj)
+```
+
+## The `DESeq` command
+
+In practice the 3 steps above can be performed in a single step using the 
+`DESeq` wrapper function. Performing the three steps separately is useful if you
+wish to alter the default parameters of one or more steps, otherwise the `DESeq`
+function is fine.
+
+```{r theShortVersion}
+ddsObj <- DESeq(ddsObj.filt)
+```
+
+## Generate a results table
+
+We can generate a table of differential expression results from the DDS object
+using the `results` function of DESeq2.
+
+```{r resultsTable}
+results.simple <- results(ddsObj, alpha=0.05)
+results.simple
+```
+
+How many genes have an adjusted p-value of less than 0.05?
+
+```{r countSigGenes}
+sum(results.simple$padj < 0.05)
+```
+
+
+### Independent filtering
+
+You will notice that some of the adjusted p-values (`padj`) are NA. 
+
+```{r}
+sum(is.na(results.simple$padj))
+```
+
+Remember 
+in Session 7 we said that there is no need to pre-filter the genes as DESeq2
+will do this through a process it calls 'independent filtering'. The genes 
+with `NA` are the ones `DESeq2` has filtered out.
+
+From `DESeq2` manual:
+"The results function of the `DESeq2` package performs independent filtering by
+default using the mean of normalized counts as a filter statistic. A threshold 
+on the filter statistic is found which optimizes the number of adjusted p values
+lower than a [specified] significance level".
+
+The default significance level for independent filtering is `0.1`, however, you
+should set this to the FDR cut off you are planning to use. We will use `0.05` -
+this was the purpose of the `alpha` argument in the previous command.
+
+How many genes have an adjusted p-value of less than 0.05?
+
+```{r countSigGenes2}
+sum(results.simple$padj < 0.05, na.rm = TRUE)
+```
+
+How many genes are up-regulated?
+
+```{r countSigUpGenes2}
+sum(results.simple$padj < 0.05 & results.simple$log2FoldChange > 0, na.rm = TRUE)
+```
+
+How many genes are down-regulated?
+
+```{r countSigDownGenes2}
+sum(results.simple$padj < 0.05 & results.simple$log2FoldChange < 0, na.rm = TRUE)
+```
+
+### Exercise 1
+
+Now we have made our results table using our simple model, how many genes are:
+
+a) significantly up-regulated?
+
+b) significantly down-regulated?