Skip to content

Commit 69f84cc

Browse files
committed
Convert results filtering to non-tidyverse versions, remove redundant instructions for logging onto HPCOnDemand
1 parent 242ec02 commit 69f84cc

File tree

1 file changed

+13
-34
lines changed

1 file changed

+13
-34
lines changed

lessons/wk7_lesson02_wald_test.md

Lines changed: 13 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -118,17 +118,13 @@ The results table that is returned to us is **a `DESeqResults` object**, which i
118118
class(res_tableOE)
119119
```
120120

121-
Now let's take a look at **what information is stored** in the results:
121+
Now let's take a look at **what information is stored** in the results, using nested functions that convert `res_table0E` into a data frame that we can then View:
122122

123123
``` r
124124
# What is stored in results?
125-
res_tableOE %>%
126-
data.frame() %>%
127-
View()
125+
View(data.frame(res_tableOE))
128126
```
129127

130-
> **Discussion:** The `%>%` acts as a pipe symbol in R. This functionality comes as part of the [`dplyr`](https://dplyr.tidyverse.org/) package, which was loaded as part of the `tidyverse` that we loaded at the beginning of our lessons. Knowing this, what exactly is the code above doing?
131-
132128
We have six columns of information reported for each gene (row). We can use the `mcols()` function to extract information on what the values stored in each column represent:
133129

134130
``` r
@@ -167,13 +163,14 @@ The missing values represent genes that have undergone filtering as part of the
167163
If within a row, all samples have zero counts there is no expression information and therefore these genes are not tested.
168164

169165
``` r
170-
# Filter genes by zero expression
171-
res_tableOE[which(res_tableOE$baseMean == 0),] %>%
172-
data.frame() %>%
173-
View()
166+
# Filter genes by zero expression and view using the same type of nested command as above
167+
View(data.frame(res_tableOE[which(res_tableOE$baseMean == 0),]))
168+
169+
# You could also count the number of rows that are left in the filtered data frame
170+
nrow(data.frame(res_tableOE[which(res_tableOE$baseMean == 0),]))
174171
```
175172

176-
> **The baseMean column for these genes will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA. *How would you adjust the command above to count the number of rows matching this condition*?**
173+
> **The baseMean column for these genes will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA.**
177174
178175
**2. Genes with an extreme count outlier**
179176

@@ -182,11 +179,9 @@ The `DESeq()` function calculates, for every gene and for every sample, a diagno
182179
``` r
183180
# Filter genes that have an extreme outlier by looking for those rows that have a non-zero base mean but no values for p-value and adjusted p-value. Do we actually have any of these?
184181

185-
res_tableOE[which(is.na(res_tableOE$pvalue) &
182+
View(data.frame(res_tableOE[which(is.na(res_tableOE$pvalue) &
186183
is.na(res_tableOE$padj) &
187-
res_tableOE$baseMean > 0),] %>%
188-
data.frame() %>%
189-
View()
184+
res_tableOE$baseMean > 0),]))
190185
```
191186

192187
> **If a gene contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA.**
@@ -207,11 +202,9 @@ At a user-specified value (`alpha = 0.1`), DESeq2 evaluates the change in the nu
207202

208203
``` r
209204
# Filter genes below the low mean threshold
210-
res_tableOE[which(!is.na(res_tableOE$pvalue) &
205+
View(data.frame(res_tableOE[which(!is.na(res_tableOE$pvalue) &
211206
is.na(res_tableOE$padj) &
212-
res_tableOE$baseMean > 0),] %>%
213-
data.frame() %>%
214-
View()
207+
res_tableOE$baseMean > 0),]))
215208
```
216209

217210
> **If a gene is filtered by independent filtering, then only the adjusted p-value will be set to NA.**
@@ -234,23 +227,9 @@ log2 (normalized_counts_group1 / normalized_counts_group2)
234227

235228
The problem is, these fold change estimates are not entirely accurate as they do not account for the large dispersion we observe with low read counts. To address this, the **log2 fold changes need to be adjusted**.
236229

237-
**This is where we stopped on Tuesday of Week 7!**
238-
239230
------------------------------------------------------------------------
240231

241-
### More accurate LFC estimates: Picking up again from Tuesday
242-
243-
1.Get your HPC On Demand session going:
244-
245-
- Opening up RStudio using [HPC on Demand](https://hpcondemand.nih.gov/pun/sys/dashboard/), using default values except for Starting Directory and **INCREASE MEMORY TO 8G**: `/data/Bspc-training/YOUR_USERNAME/rnaseq`
246-
247-
- To check whether or not you are in the correct working directory, use `getwd()`. Something like `/vf/users/Bspc-training/changes/rnaseq` should come up.
248-
249-
- Using the Project menu in the top right corner, or the Files Pane window (clicking rnaseq -\> DEanalysis), to navigate to and open `DEanalysis.Rproj`
250-
251-
2. We are assuming that you have the `dds` object in your environment and your packages are loaded - run your `de_setup.R` script if needed!
252-
253-
3. Run the actual DESeq2 analysis if needed `dds <- DESeq(dds)`.
232+
## More accurate LFC estimates
254233

255234
To generate more accurate log2 foldchange (LFC) estimates, DESeq2 allows for the **shrinkage of the LFC estimates toward zero** when the information for a gene is low, which could include:
256235

0 commit comments

Comments
 (0)