Skip to content

Commit 26e0cae

Browse files
authored
Merge pull request #117 from NICHD-BSPC/sally-edits-may2025
Week 7 01-02: Catch-up scripts and remove dplyr pipes
2 parents 1520a20 + 69f84cc commit 26e0cae

File tree

3 files changed

+134
-61
lines changed

3 files changed

+134
-61
lines changed

lessons/wk6_lesson05_deseq2_analysis.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Gene-level differential expression analysis with DESeq2"
33
author: "Harvard HPC Staff, Adapted by Sally Chang at NICHD"
4-
date: "Last Modified March 2025"
4+
date: "Last Modified May 2025"
55
---
66

77
Approximate time: 60 minutes
@@ -13,6 +13,31 @@ Approximate time: 60 minutes
1313
- Inspect gene-level dispersion estimates
1414
- Recognize the importance of dispersion during differential expression analysis
1515

16+
## Catch-Up Script
17+
18+
If you need to be completely caught up, you can copy and paste the following into an R Script and run it. If you don't already have the files in your `/data` directory, please see [Wk 5 Lesson 01](../wk5_lesson01_introR_Rstudio.md) for instructions on where to obtain the input files.
19+
20+
``` r
21+
# Setup
22+
# Bioconductor and CRAN libraries used - already installed on Biowulf
23+
library(tidyverse)
24+
library(RColorBrewer)
25+
library(DESeq2)
26+
library(pheatmap)
27+
library(BiocManager)
28+
29+
# Load in data
30+
data <- read.table("data/mov10_AllSamples_featurecounts.Rmatrix.txt", header=T, row.names=1)
31+
32+
meta <- read.table("data/mov10_AllSamples_metadata.txt", header=T, row.names=1)
33+
34+
# Create DESeq2Dataset object
35+
dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ sampletype)
36+
37+
# Run DESeq2 on DESeq2Dataset object
38+
dds <- DESeq(dds)
39+
```
40+
1641
## DESeq2 differential gene expression analysis workflow
1742

1843
Previously, we created the DESeq2 object using the appropriate design formula and running DESeq2 using the two lines of code:

lessons/wk7_lesson01_hypothesis_testing.md

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Hypothesis testing and multiple testing"
33
author: "Harvard HPC Staff, Adapted by Sally Chang @ NICHD"
4-
date: "Last Modified February 2025"
4+
date: "Last Modified May 2025"
55
---
66

77
Approximate time: 60 minutes
@@ -13,13 +13,40 @@ Approximate time: 60 minutes
1313
- Recognize the importance of multiple test correction
1414
- Identify different methods for multiple test correction
1515

16+
## Catch-Up Script
17+
18+
If you need to be completely caught up, you can copy and paste the following into an R Script and run it. If you don't already have the files in your `/data` directory, please see [Wk 5 Lesson 01](../wk5_lesson01_introR_Rstudio.md) for instructions on where to obtain the input files.
19+
20+
``` r
21+
# Setup
22+
# Bioconductor and CRAN libraries used - already installed on Biowulf
23+
library(tidyverse)
24+
library(RColorBrewer)
25+
library(DESeq2)
26+
library(pheatmap)
27+
library(BiocManager)
28+
29+
# Load in data
30+
data <- read.table("data/mov10_AllSamples_featurecounts.Rmatrix.txt", header=T, row.names=1)
31+
32+
meta <- read.table("data/mov10_AllSamples_metadata.txt", header=T, row.names=1)
33+
34+
# Create DESeq2Dataset object
35+
dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ sampletype)
36+
37+
# Run DESeq2 on DESeq2Dataset object
38+
dds <- DESeq(dds)
39+
```
40+
41+
Remember to get your HPC On Demand session going, if applicable, and open your `DEAnalysis` R project!
42+
1643
# DESeq2: Model fitting and Hypothesis testing
1744

1845
The final step in the DESeq2 workflow is taking the counts for each gene and fitting it to the model and testing for differential expression.
1946

2047
<p align="center">
2148

22-
<img src="../img/deseq_workflow_full_2018.png" width="385" alt="deseq full workflow"/>
49+
<img src="../img/deseq_workflow_full_2018.png" alt="deseq full workflow" width="385"/>
2350

2451
</p>
2552

@@ -151,6 +178,34 @@ DESeq2 helps reduce the number of genes tested by removing those genes unlikely
151178

152179
By setting the FDR cutoff to \< 0.05, we're saying that the proportion of false positives we expect amongst our differentially expressed genes is 5%. For example, if you call 500 genes as differentially expressed with an FDR cutoff of 0.05, you expect 25 of them to be false positives.
153180

181+
## Your DE script
182+
183+
In this lesson, we took the additional step of running an additional DESeq2 analysis using a Likelihood Ratio Test to create the `dds_lrt` object . Your `de_script.R` should now contain the following commands to re-create necessary data objects (click to show):
184+
185+
``` r
186+
# Setup
187+
# Bioconductor and CRAN libraries used - already installed on Biowulf
188+
library(tidyverse)
189+
library(RColorBrewer)
190+
library(DESeq2)
191+
library(pheatmap)
192+
library(BiocManager)
193+
194+
# Load in data
195+
data <- read.table("data/mov10_AllSamples_featurecounts.Rmatrix.txt", header=T, row.names=1)
196+
197+
meta <- read.table("data/mov10_AllSamples_metadata.txt", header=T, row.names=1)
198+
199+
# Create DESeq2Dataset object
200+
dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ sampletype)
201+
202+
# Run DESeq2 on DESeq2Dataset object
203+
dds <- DESeq(dds)
204+
205+
# Likelihood ratio test
206+
dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1)
207+
```
208+
154209
------------------------------------------------------------------------
155210

156211
*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*

lessons/wk7_lesson02_wald_test.md

Lines changed: 51 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Exploring DESeq2 results: Wald test"
33
author: "Harvard HPC Staff, adapted by Sally Chang at NICHD"
4-
edited: "Last Modified March 2025"
4+
edited: "Last Modified May 2025"
55
---
66

77
Approximate time: 60 minutes
@@ -12,6 +12,36 @@ Approximate time: 60 minutes
1212
- Summarize the different levels of gene filtering
1313
- Explain log fold change shrinkage
1414

15+
## Catch-Up Script
16+
17+
If you need to be completely caught up, you can copy and paste the following into an R Script and run it. If you don't already have the files in your `/data` directory, please see [Wk 5 Lesson 01](../wk5_lesson01_introR_Rstudio.md) for instructions on where to obtain the input files.
18+
19+
``` r
20+
# Setup
21+
# Bioconductor and CRAN libraries used - already installed on Biowulf
22+
library(tidyverse)
23+
library(RColorBrewer)
24+
library(DESeq2)
25+
library(pheatmap)
26+
library(BiocManager)
27+
28+
# Load in data
29+
data <- read.table("data/mov10_AllSamples_featurecounts.Rmatrix.txt", header=T, row.names=1)
30+
31+
meta <- read.table("data/mov10_AllSamples_metadata.txt", header=T, row.names=1)
32+
33+
# Create DESeq2Dataset object
34+
dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ sampletype)
35+
36+
# Run DESeq2 on DESeq2Dataset object
37+
dds <- DESeq(dds)
38+
39+
# Likelihood ratio test
40+
dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1)
41+
```
42+
43+
Remember to get your HPC On Demand session going, if applicable, and open your `DEAnalysis` R project!
44+
1545
# Exploring Results (Wald test)
1646

1747
By default DESeq2 uses the Wald test to identify genes that are differentially expressed between two sample classes. Given the factor(s) used in the design formula, and how many factor levels are present, we can extract results for a number of different comparisons. Here, we will walk through how to obtain results from the `dds` object and provide some explanations on how to interpret them.
@@ -56,28 +86,14 @@ Alternatively, if you **only had two factor levels you could do nothing** and no
5686
To start, we want to evaluate **expression changes between the MOV10 overexpression samples and the control samples**. As such we will use the first method for specifying contrasts and create a character vector:
5787

5888
``` r
59-
## Define contrasts for MOV10 overexpression
89+
## Define contrasts for MOV10 overexpression - please run this!
6090
contrast_oe <- c("sampletype", "MOV10_overexpression", "control")
6191
```
6292

6393
> ### Does it matter what I choose to be my base level?
6494
>
6595
> Yes, it does matter. **Deciding what level is the base level will determine how to interpret the fold change that is reported.** So for example, if we observe a log2 fold change of -2 this would mean the gene expression is lower in factor level of interest relative to the base level. Thus, if leaving it up to DESeq2 to decide on the contrasts be sure to check that the alphabetical order coincides with the fold change direction you are anticipating.
6696
67-
## Getting Ready to work in R
68-
69-
1.Get your HPC On Demand session going:
70-
71-
- Opening up RStudio using [HPC on Demand](https://hpcondemand.nih.gov/pun/sys/dashboard/), using default values except for Starting Directory: `/data/Bspc-training/YOUR_USERNAME/rnaseq`
72-
73-
- To check whether or not you are in the correct working directory, use `getwd()`. Something like `/vf/users/Bspc-training/changes/rnaseq` should come up.
74-
75-
- Using the Project menu in the top right corner, or the Files Pane window (clicking rnaseq -\> DEanalysis), to navigate to and open `DEanalysis.Rproj`
76-
77-
2. We are assuming that you have the `dds` object in your environment and your packages are loaded - run your `de_setup.R` script if needed!
78-
79-
3. Run the actual DESeq2 analysis if needed `dds <- DESeq(dds)`.
80-
8197
## The results table
8298

8399
Now that we have our contrast created, we can use it as input to the `results()` function. Let's take a quick look at the help manual for the function:
@@ -102,17 +118,13 @@ The results table that is returned to us is **a `DESeqResults` object**, which i
102118
class(res_tableOE)
103119
```
104120

105-
Now let's take a look at **what information is stored** in the results:
121+
Now let's take a look at **what information is stored** in the results, using nested functions that convert `res_table0E` into a data frame that we can then View:
106122

107123
``` r
108124
# What is stored in results?
109-
res_tableOE %>%
110-
data.frame() %>%
111-
View()
125+
View(data.frame(res_tableOE))
112126
```
113127

114-
> **Discussion:** The `%>%` acts as a pipe symbol in R. This functionality comes as part of the [`dplyr`](https://dplyr.tidyverse.org/) package, which was loaded as part of the `tidyverse` that we loaded at the beginning of our lessons. Knowing this, what exactly is the code above doing?
115-
116128
We have six columns of information reported for each gene (row). We can use the `mcols()` function to extract information on what the values stored in each column represent:
117129

118130
``` r
@@ -151,13 +163,14 @@ The missing values represent genes that have undergone filtering as part of the
151163
If within a row, all samples have zero counts there is no expression information and therefore these genes are not tested.
152164

153165
``` r
154-
# Filter genes by zero expression
155-
res_tableOE[which(res_tableOE$baseMean == 0),] %>%
156-
data.frame() %>%
157-
View()
166+
# Filter genes by zero expression and view using the same type of nested command as above
167+
View(data.frame(res_tableOE[which(res_tableOE$baseMean == 0),]))
168+
169+
# You could also count the number of rows that are left in the filtered data frame
170+
nrow(data.frame(res_tableOE[which(res_tableOE$baseMean == 0),]))
158171
```
159172

160-
> **The baseMean column for these genes will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA. *How would you adjust the command above to count the number of rows matching this condition*?**
173+
> **The baseMean column for these genes will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA.**
161174
162175
**2. Genes with an extreme count outlier**
163176

@@ -166,11 +179,9 @@ The `DESeq()` function calculates, for every gene and for every sample, a diagno
166179
``` r
167180
# Filter genes that have an extreme outlier by looking for those rows that have a non-zero base mean but no values for p-value and adjusted p-value. Do we actually have any of these?
168181

169-
res_tableOE[which(is.na(res_tableOE$pvalue) &
182+
View(data.frame(res_tableOE[which(is.na(res_tableOE$pvalue) &
170183
is.na(res_tableOE$padj) &
171-
res_tableOE$baseMean > 0),] %>%
172-
data.frame() %>%
173-
View()
184+
res_tableOE$baseMean > 0),]))
174185
```
175186

176187
> **If a gene contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA.**
@@ -191,11 +202,9 @@ At a user-specified value (`alpha = 0.1`), DESeq2 evaluates the change in the nu
191202

192203
``` r
193204
# Filter genes below the low mean threshold
194-
res_tableOE[which(!is.na(res_tableOE$pvalue) &
205+
View(data.frame(res_tableOE[which(!is.na(res_tableOE$pvalue) &
195206
is.na(res_tableOE$padj) &
196-
res_tableOE$baseMean > 0),] %>%
197-
data.frame() %>%
198-
View()
207+
res_tableOE$baseMean > 0),]))
199208
```
200209

201210
> **If a gene is filtered by independent filtering, then only the adjusted p-value will be set to NA.**
@@ -218,23 +227,9 @@ log2 (normalized_counts_group1 / normalized_counts_group2)
218227

219228
The problem is, these fold change estimates are not entirely accurate as they do not account for the large dispersion we observe with low read counts. To address this, the **log2 fold changes need to be adjusted**.
220229

221-
**This is where we stopped on Tuesday of Week 7!**
222-
223230
------------------------------------------------------------------------
224231

225-
### More accurate LFC estimates: Picking up again from Tuesday
226-
227-
1.Get your HPC On Demand session going:
228-
229-
- Opening up RStudio using [HPC on Demand](https://hpcondemand.nih.gov/pun/sys/dashboard/), using default values except for Starting Directory and **INCREASE MEMORY TO 8G**: `/data/Bspc-training/YOUR_USERNAME/rnaseq`
230-
231-
- To check whether or not you are in the correct working directory, use `getwd()`. Something like `/vf/users/Bspc-training/changes/rnaseq` should come up.
232-
233-
- Using the Project menu in the top right corner, or the Files Pane window (clicking rnaseq -\> DEanalysis), to navigate to and open `DEanalysis.Rproj`
234-
235-
2. We are assuming that you have the `dds` object in your environment and your packages are loaded - run your `de_setup.R` script if needed!
236-
237-
3. Run the actual DESeq2 analysis if needed `dds <- DESeq(dds)`.
232+
## More accurate LFC estimates
238233

239234
To generate more accurate log2 foldchange (LFC) estimates, DESeq2 allows for the **shrinkage of the LFC estimates toward zero** when the information for a gene is low, which could include:
240235

@@ -329,11 +324,11 @@ Now that we have results for the overexpression results, do the same for the **C
329324
3. Shrink the LFC estimates using `lfcShrink()` and assign it back to `res_tableKD`.
330325
4. Store the code you used to do the above exercises in an RScript named `control_vs_kd.R` .
331326

332-
## R Script updates:
327+
## R Script updates:
333328

334329
Here is what your `de_setup.R` script should look like now to regenerate all the R objects we will need to the next lesson:
335330

336-
```r
331+
``` r
337332
# Gene-level differential expression analysis using DESeq2
338333

339334
# Setup
@@ -369,9 +364,7 @@ res_tableOE <- lfcShrink(dds, coef="sampletype_MOV10_overexpression_vs_control",
369364

370365
Now, on to [Week 7, lesson 03!](../lessons/wk7_lesson03_summarizing_dge_results.md)
371366

372-
------------------------------------------------------------------------
373-
*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
374-
375-
*Some materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*
376-
377-
------------------------------------------------------------------------
367+
| |
368+
|------------------------------------------------------------------------|
369+
| *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* |
370+
| *Some materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website* |

0 commit comments

Comments
 (0)