Merge pull request #39 from stemangiola/post_rehearsal_changes

mblue9 · web-flow · commit 227c4a14aafc · 2020-07-27T15:46:47.000+10:00
Post rehearsal changes
diff --git a/vignettes/solutions.Rmd b/vignettes/solutions.Rmd
@@ -0,0 +1,196 @@
+---
+title: "Bioc 2020 Tidytranscriptomics Solutions"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Solutions}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+Questions:  
+1. What is the Fraction of Variance for PC1 and PC2? What do PC1 and PC2 represent?  
+2. How many DE genes are there for treated vs untreated? What is the top DE gene by P value?  
+3. What code can generate a heatmap of variable genes (starting from count_scaled)?  
+4. What code can you use to visualise expression of the pasilla gene (gene id: FBgn0261552)  
+5. What code can generate an interactive volcano plot that has gene symbols showing on hover?  
+6. What code can generate a heatmap of the top 100 DE genes?
+
+Suggested answers are below. You might have some different code e.g. to customise the volcano plot as you like. Feel free to comment on any of these solutions in the workshop website as described [here](https://github.com/stemangiola/bioc_2020_tidytranscriptomics/blob/master/CONTRIBUTING.md).
+
+```{r out.width = "40%", message=FALSE, warning=FALSE}
+# load libraries
+
+# tidyverse core packages
+library(tibble)
+library(dplyr)
+library(tidyr)
+library(readr)
+library(stringr)
+library(ggplot2)
+
+# tidyverse-friendly packages
+library(tidyHeatmap)
+library(tidybulk)
+library(ggrepel)
+library(plotly)
+
+# load data
+data("pasilla", package = "bioc2020tidytranscriptomics")
+
+# create tidybulk tibble
+counts_tt <- pasilla %>% 
+  tidybulk()
+
+# scale counts
+counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = condition)
+
+# create density plots
+counts_scaled %>%
+  filter(!lowly_abundant) %>%
+	pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%
+  ggplot(aes(x=abundance + 1, group=sample, color=condition)) +
+	geom_density() +
+	facet_wrap(~source) +
+	scale_x_log10() +
+	theme_bw()
+```
+
+1. What is the Fraction of Variance for PC1 and PC2? 
+
+```{r}
+counts_scal_PCA <-
+  counts_scaled %>%
+  reduce_dimensions(method="PCA")
+```
+
+Answer: PC1: 47%, PC2: 25%
+
+What do PC1 and PC2 represent?
+
+```{r out.width = "40%"}
+counts_scal_PCA %>%
+	pivot_sample() %>%
+	ggplot(aes(x=PC1, y=PC2, colour=condition, shape=type)) + 
+  geom_point() +
+  geom_text_repel(aes(label=sample), show.legend = FALSE) +
+	theme_bw()
+```
+
+Answer: PC1 represents variance due to treatment effect(treated vs untreated). PC2 represents variance due to sequencing type single vs paired.
+
+
+```{r}
+counts_de <-
+  counts_tt %>%
+  test_differential_abundance(.formula = ~ 0 + condition + type, 
+                              .contrasts = c("conditiontreated - conditionuntreated"), 
+                              omit_contrast_in_colnames = TRUE)
+```
+
+2. How many DE genes are there for treated vs untreated (FDR < 0.05)?
+
+```{r}
+counts_de %>% 
+  filter(significant == TRUE) %>% 
+  summarise(num_de = n_distinct(feature))
+```
+
+Answer: 1128
+
+What is the top DE gene by P value? 
+
+```{r}
+topgenes <- counts_de %>%
+	pivot_transcript() %>%
+  arrange(PValue) %>%
+  head(6)
+
+topgenes
+```	
+
+Answer: FBgn0025111
+
+
+3. What code can generate a heatmap of variable genes (starting from count_scaled)?  
+
+```{r out.width = "40%"}
+counts_scaled %>% 
+  
+  # filter lowly abundant
+  filter(!lowly_abundant) %>%
+	
+	# extract 500 most variable genes
+	keep_variable( .abundance = counts_scaled, top = 500) %>%
+	
+	# create heatmap
+	heatmap(
+	      .column = sample,
+	      .row = feature,
+	      .value = counts_scaled,
+	      annotation = c(condition, type),
+	      transform = log1p 
+	  )
+```
+
+4. What code can you use to visualise expression of the pasilla gene (gene id: FBgn0261552) 
+
+```{r out.width = "40%"}
+counts_scaled %>%
+	
+	# extract counts for pasilla gene
+	filter(feature == "FBgn0261552") %>%
+	
+	# make stripchart
+	ggplot(aes(x = condition, y = counts_scaled + 1, fill =condition, label = sample)) +
+	geom_boxplot() +
+	geom_jitter() +
+	scale_y_log10()+
+	theme_bw()
+```
+
+5. What code can generate an interactive volcano plot that has gene ids showing on hover?  
+
+```{r out.width = "40%"}
+p <- counts_de %>%
+	pivot_transcript() %>%
+
+  # Subset data
+	filter(!lowly_abundant) %>%
+	mutate(significant = FDR<0.05 & abs(logFC) >=2) %>%
+
+  # Plot
+	ggplot(aes(x = logFC, y = PValue, label=feature)) +
+	geom_point(aes(color = significant, size = significant, alpha=significant)) +
+	geom_text_repel() +
+	
+	# Custom scales
+	scale_y_continuous(trans = "log10_reverse") +
+	scale_color_manual(values=c("black", "#e11f28")) +
+	scale_size_discrete(range = c(0, 2)) +
+	theme_bw()
+
+ggplotly(p, tooltip = c("text"))
+```
+Tip: You can use "text" instead of "label" if you don't want the column name to show up in the hover e.g. above will give "FBgn0261552" rather than "feature:FBgn0261552".
+
+
+
+6. What code can generate a heatmap of the top 100 DE genes?
+
+```{r out.width = "40%"}
+top100 <- 
+	counts_de %>%
+	pivot_transcript() %>%
+	arrange(PValue) %>%
+	head(100)
+
+counts_scaled %>% 
+  filter(feature %in% top100$feature) %>%
+	heatmap(
+	      .column = sample,
+	      .row = feature,
+	      .value = counts_scaled,
+	      annotation = c(condition, type),
+	      transform = log1p 
+	  )
+```
diff --git a/vignettes/supplementary.Rmd b/vignettes/supplementary.Rmd
@@ -53,7 +53,7 @@ counts_tt <-
 	# shorten sample name
 	mutate(sample=str_remove(sample, "SRR1039")) %>%
 
-	# convert to tidybulk object
+	# convert to tidybulk tibble
 	tidybulk(.sample=sample, .transcript=geneID, .abundance=counts)
 ```
 
@@ -67,8 +67,7 @@ counts_tt %>%
 
 We can also check how many counts we have for each sample by making a bar plot. This helps us see whether there are any major discrepancies between the samples more easily.
 
-```{r}
-# make barplot of counts
+```{r out.width = "40%"}
 ggplot(counts_tt, aes(x=sample, weight=counts, fill=sample)) + 
 	geom_bar() +
 	theme_bw()
@@ -78,14 +77,14 @@ As we are using ggplot2, we can also easily view by any other variable that's a
 
 We can colour by dex treatment.
 
-```{r}
+```{r out.width = "40%"}
 ggplot(counts_tt, aes(x=sample, weight=counts, fill=dex)) + 
 	geom_bar() +
 	theme_bw()
 ```
 We can colour by cell line.
 
-```{r}
+```{r out.width = "40%"}
 ggplot(counts_tt, aes(x=sample, weight=counts, fill=cell)) + 
 	geom_bar() +
 	theme_bw()
@@ -94,7 +93,7 @@ ggplot(counts_tt, aes(x=sample, weight=counts, fill=cell)) +
 
 ## How to examine normalised counts with boxplots
 
-```{r}
+```{r out.width = "40%"}
 # scale counts
 counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)
 
@@ -112,7 +111,7 @@ counts_scaled %>%
 
 ## How to create MDS plot
 
-```{r}
+```{r out.width = "40%"}
 airway %>%
 	tidybulk() %>%
 	scale_abundance(factor_of_interest=dex) %>%
@@ -127,7 +126,7 @@ airway %>%
 
 MA plots enable us to visualise amount of expression (logCPM) versus logFC. Highly expressed genes are towards the right of the plot. We can also colour significant genes (e.g. genes with FDR < 0.05) 
 
-```{r}
+```{r out.width = "40%"}
 # perform differential testing
 counts_de <- 
 	counts_tt %>%
@@ -148,7 +147,7 @@ counts_de %>%
 
 A more informative MA plot, integrating some of the packages in tidyverse.
 
-```{r warning=FALSE}
+```{r out.width = "40%", warning=FALSE}
 counts_de %>%
 	pivot_transcript() %>%
 	
@@ -167,9 +166,9 @@ counts_de %>%
 ```
 
 
-## How to perform gene set analysis
+## How to perform gene enrichment analysis
 
-To run below you'll need the `clusterProfiler` and `org.Hs.eg.db` packages. This is just one suggestion, if you have other suggestions for how to do a 'tidy' pathway analysis feel free to [let us know](https://github.com/stemangiola/bioc_2020_tidytranscriptomics/blob/master/CONTRIBUTING.md).
+To run below you'll need the `clusterProfiler` and `org.Hs.eg.db` packages. This is just one suggestion, adapted from [here](https://simon-anders.github.io/data_analysis_course/lecture9.html). If you have other suggestions for how to do a 'tidy' pathway analysis feel free to [let us know](https://github.com/stemangiola/bioc_2020_tidytranscriptomics/blob/master/CONTRIBUTING.md).
 
 ```{r eval=FALSE}
 library(clusterProfiler)
diff --git a/vignettes/tidytranscriptomics.Rmd b/vignettes/tidytranscriptomics.Rmd