Skip to content

Commit 5a118bb

Browse files
authored
Merge pull request #119 from NICHD-BSPC/sally-edits-may2025
Edits Week 7 Lesson 04: Volcano plots
2 parents 8c143c1 + ac62603 commit 5a118bb

File tree

4 files changed

+30
-22
lines changed

4 files changed

+30
-22
lines changed

img/volcano_adjusted_pvalue.png

66.4 KB
Loading

img/volcano_basemean.png

76.1 KB
Loading

img/volcano_raw_pvalue.png

63.7 KB
Loading

lessons/wk7_lesson04_visualizing_results.md

Lines changed: 30 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ padj.cutoff <- 0.1
6767
res_OE_df <- rownames_to_column(data.frame(res_tableOE), var="gene")
6868

6969
#import relevant parts of GTF file
70-
gtf_names <- read.table("/data/Bspc-training/shared/rnaseq_jan2025/downstream_data/gtf_names.txt", header=TRUE)
70+
gtf_names <- read.table("/data/Bspc-training/shared/rnaseq_mov10/downstream_data/gtf_names.txt", header=TRUE)
7171

7272
#merge gene symbols
7373
res_OE_df <- merge(gtf_names,res_OE_df, by.x="ensgene", by.y="gene")
@@ -208,10 +208,10 @@ To generate a volcano plot, we first need to have a column in our results data i
208208

209209
``` r
210210
## Obtain logical vector where TRUE values denote padj values < 0.05 and fold change > 1.5 in either direction (meaning log2FC >= 0.58)
211-
OE_signif_vector <- res_OE_df$padj < 0.1 & abs(res_OE_df$log2FoldChange) >= 0.58
211+
significant_OE <- res_OE_df$padj < 0.1 & abs(res_OE_df$log2FoldChange) >= 0.58
212212

213213
## Add this vector as a new column to create version of res_OE_df for custom plots
214-
res_OE_df_plotting <- cbind(res_OE_df, OE_signif_vector)
214+
res_OE_df_plotting <- cbind(res_OE_df, significant_OE)
215215
```
216216

217217
Now we can start plotting. The `geom_point` object is most applicable, as this is essentially a scatter plot:
@@ -221,32 +221,32 @@ Now we can start plotting. The `geom_point` object is most applicable, as this i
221221
``` r
222222
## Volcano plot with adjusted p-values
223223
ggplot(res_OE_df_plotting) +
224-
geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = OE_signif_vector)) +
224+
geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = significant_OE)) +
225225
ggtitle("Mov10 overexpression") +
226226
xlab("log2 fold change") +
227227
ylab("-log10 adjusted p-value") +
228-
#scale_y_continuous(limits = c(0,50)) +
229228
theme(legend.position = "none",
230229
plot.title = element_text(size = rel(1.5), hjust = 0.5),
231230
axis.title = element_text(size = rel(1.25)))
232231
```
233232

234-
<img src="../img/mov10_oe_unlabeled_volcano.png" width="500"/>
233+
<img src="../img/volcano_adjusted_pvalue.png" alt="volcano plot using adjusted p-value" width="600"/>
235234

236235
### Volcano Plot with raw p-values
237236

238237
``` r
239238
ggplot(res_OE_df_plotting) +
240-
geom_point(aes(x = log2FoldChange, y = -log10(pvalue), colour = OE_signif_vector)) +
239+
geom_point(aes(x = log2FoldChange, y = -log10(pvalue), colour = significant_OE)) +
241240
ggtitle("Mov10 overexpression") +
242241
xlab("log2 fold change") +
243242
ylab("-log10 p-value") +
244-
#scale_y_continuous(limits = c(0,50)) +
245243
theme(legend.position = "none",
246244
plot.title = element_text(size = rel(1.5), hjust = 0.5),
247245
axis.title = element_text(size = rel(1.25)))
248246
```
249247

248+
<img src="../img/volcano_raw_pvalue.png" alt="volcano plot using raw p-value" width="600"/>
249+
250250
### Volcano Plot colored log10(basemean)
251251

252252
``` r
@@ -255,46 +255,54 @@ ggplot(res_OE_df_plotting) +
255255
ggtitle("Mov10 overexpression") +
256256
xlab("log2 fold change") +
257257
ylab("-log10 p-value") +
258-
#scale_y_continuous(limits = c(0,50)) +
259258
theme(legend.position = "none",
260259
plot.title = element_text(size = rel(1.5), hjust = 0.5),
261260
axis.title = element_text(size = rel(1.25)))
262261
```
263262

264-
### Volcano plot with custom gene list
263+
<img src="../img/volcano_basemean.png" alt="volcano plot with points colored by log of basemean instead of significance" width="600"/>
264+
265+
### Volcano plot with ranked gene list
266+
267+
This is a great way to get an overall picture of what is going on, but we may also want to know the names of the top 10 most differentially expressed genes (by lowest padj) and where they are located on this plot. It could helpful for us in understanding if there is anything unusual about the relationship between
268+
269+
This same type of labeling technique can also be used to label the top lowest or highest genes sorted by any variable in our results dataframe (e.g. basemean, raw pvalue, log2foldchange etc).
265270

266-
This is a great way to get an overall picture of what is going on, but what if we also wanted to know where the top 10 genes (lowest padj) in our DE list are located on this plot? We could label those dots with the gene name on the Volcano plot using `geom_text_repel()`.
271+
We are going to label those dots with the gene name on the Volcano plot using `geom_text_repel()`. Preparing for this will take a few steps in Base R:
267272

268-
First, we need to order the res_tableOE by `padj`, and add an additional column to it, to include on those gene names we want to use to label the plot.
273+
- Add an additional column to our current plotting dataframe, to put those gene names we want to use to label the plot.
274+
275+
- We need to order the res_tableOE by `padj`
276+
277+
- Fill the new empty column with values from the `symbol` column just for the top 10 genes in the sorted dataframe
269278

270279
``` r
271-
## Create an empty column to indicate which genes to label
272-
res_tableOE_plotting <- res_tableOE_plotting %>% dplyr::mutate(genelabels = "")
280+
# Create an empty column
281+
res_OE_df_plotting$genelabels <- ""
273282

274-
## Sort by padj values
275-
res_tableOE_plotting <- res_tableOE_plotting %>% dplyr::arrange(padj)
283+
# Sort by padj values
284+
res_OE_df_plotting <- res_OE_df_plotting[order(res_OE_df_plotting$padj), ]
276285

277286
## Populate the genelabels column with contents of the gene symbols column for the first 10 rows, i.e. the top 10 most significantly expressed genes
278-
res_tableOE_plotting$genelabels[1:10] <- as.character(res_tableOE_plotting$symbol[1:10])
287+
res_OE_df_plotting$genelabels[1:10] <- as.character(res_OE_df_plotting$symbol[1:10])
279288

280289
View(res_tableOE_plotting)
281290
```
282291

283292
Next, we plot it as before with an additional layer for `geom_text_repel()` wherein we can specify the column of gene labels we just created.
284293

285294
``` r
286-
ggplot(res_tableOE_plotting, aes(x = log2FoldChange, y = -log10(padj))) +
287-
geom_point(aes(colour = threshold_OE)) +
295+
ggplot(res_OE_df_plotting, aes(x = log2FoldChange, y = -log10(padj))) +
296+
geom_point(aes(colour = significant_OE)) +
288297
geom_text_repel(aes(label = genelabels)) +
289298
ggtitle("Mov10 overexpression") +
290299
xlab("log2 fold change") +
291300
ylab("-log10 adjusted p-value") +
292-
theme(legend.position = "none",
293-
plot.title = element_text(size = rel(1.5), hjust = 0.5),
301+
theme(plot.title = element_text(size = rel(1.5), hjust = 0.5),
294302
axis.title = element_text(size = rel(1.25)))
295303
```
296304

297-
<img src="../img/mov10_oe_labeled_volcano.png" width="500"/>
305+
<img src="../img/volcano_top10_labeled.png" alt="volcano plot top 10 genes with lowest adjusted p-values labeled" width="600"/>
298306

299307
### Selecting Your Own Gene
300308

0 commit comments

Comments
 (0)