Skip to content

Commit 1bafd25

Browse files
committed
Small revisions.
1 parent e264e29 commit 1bafd25

File tree

1 file changed

+12
-13
lines changed

1 file changed

+12
-13
lines changed

05-01-DEG.Rmd

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ We will now use a web application called [Degust](https://degust.erc.monash.edu/
1010

1111
We will be looking at data from the project [SRP062287](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062287) in the Sequence Read Archive (SRA). Read counts can be found in [GSE71960](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71960) in Gene Expression Omnibus (GEO), but in this workshop we have produced our own read counts. The data is associated with [this publication](https://doi.org/10.1210/me.2013-1164).
1212

13-
Triple-negative breast cancer cell line MDA-MB-468 has been engineered to have an inducible ERβ estrogen receptor. This gene is expressed when treated with doxycycline ("plusdox"). Some samples are also be treated with estrogen E2. We expect interesting changes to happen when both this particular estrogen receptor and estrogen are present.
13+
Triple-negative breast cancer cell line MDA-MB-468 has been engineered to have an inducible ERβ estrogen receptor (encoded by the ESR2 gene). This gene is expressed when treated with doxycycline ("plusdox"). Some samples are also be treated with estrogen E2. We expect interesting changes to happen when both this particular estrogen receptor and estrogen are present.
1414

1515
There are 12 samples in 4 conditions. "DMSO_nodox" is our control condition, and we have three further conditions with doxycycline, E2, and both. Samples were produced in three replications of the experiment.
1616

@@ -35,7 +35,7 @@ These 12 read sets have been run through the nf-core/rnaseq pipeline using the [
3535

3636
→ [laxy.io output](https://laxy.io/#/job/3pLfQoLEuWeAnWh4H3Vvbv/?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a)
3737

38-
We will use `counts.star_featureCounts.tsv`. Download this file from Laxy (or save this [alternative download](https://raw.githubusercontent.com/MonashBioinformaticsPlatform/RNAseq_workshop_2024/refs/heads/main/files/counts.star_featureCounts.tsv)).
38+
We will use `counts.star_featureCounts.tsv`. Download this file from Laxy (or save this [alternative download](https://raw.githubusercontent.com/MonashBioinformaticsPlatform/RNAseq_workshop/refs/heads/main/files/counts.star_featureCounts.tsv)).
3939

4040
Now go to [Degust](https://degust.erc.monash.edu/) and press the "Upload your counts file..." to upload the file. You should end up on a configuration screen.
4141

@@ -211,7 +211,7 @@ Multi-Dimensional Scaling (MDS) provides a 2D layout of your samples, attempting
211211

212212
(We call it an MDS plot rather than a PCA plot because it's based on the `plotMDS` function in the limma package. The limma version uses a slightly different distance calculation method by default.)
213213

214-
Ideally the MDS layout will separate your samples into experimental groups. If your samples have a batch effect, this may also separate the samples along a different direction. If your samples are all jumbled together you will be sad, as you are unlikely to find many differentially expressed genes. You might also notice one or two outlier samples. You could consider excluding these samples from the analysis, ideally with additional justification based on the raw-read and alignment level QC explored earlier.
214+
Ideally the MDS layout will separate your samples into experimental groups. If your samples have a batch effect, this may also separate the samples along a different direction. If your groups are all jumbled together you will be sad, as you are unlikely to find many differentially expressed genes. You might also notice one or two outlier samples. You could consider excluding these samples from the analysis, ideally with additional justification based on the raw-read and alignment level QC explored earlier.
215215

216216
The PCA is calculated on log2(CPM+moderation), where "moderation" is a constant you can adjust. The moderation reduces the amount of noise in genes with low counts. Also the PCA can be calculated on only the most variable genes ("Num genes"). Adjusting these parameters might clean up your MDS layout a little.
217217

@@ -285,20 +285,20 @@ Also for the gene table:
285285

286286
Configure your Degust page to account for the batch effect and add the interaction contrast.
287287

288-
Explore the features we have demonstrated. Do you have observations or questions? Post interesting screenshots on the Slack channel.
288+
Explore the features we have demonstrated. Do you have observations or questions? Share any interesting screenshots.
289289

290290

291291
## Going deeper
292292

293293
### Units and normalisation
294294

295-
Different numbers of reads are obtained from different samples. Our assumption is that most genes are not differentially expressed, so the total "library size" of a sample can serve as a reference level against which to compare each gene. Counts Per Million (CPM) is therefore a convenient unit to compare the expression of a gene across different samples. You may also see CPM referred to as RPM (Reads Per Million). (Technical note: If a highly expressed gene increases in expression, it will look like all of the other genes decreased a little in terms of CPM. It is common to make adjustments to library sizes to account for this. Degust uses an adjustment called "TMM".)
295+
Different numbers of reads are obtained from different samples. Our assumption is that most genes are not differentially expressed, so the total "library size" of a sample can serve as a reference level against which to compare each gene. **Counts Per Million (CPM)** is therefore a convenient unit to compare the expression of a gene across different samples. You may also see CPM referred to as RPM (Reads Per Million). (Technical note: If a highly expressed gene increases in expression, it will look like all of the other genes decreased a little in terms of CPM. It is common to make adjustments to library sizes to account for this. Degust uses an adjustment called "TMM".)
296296

297-
If you want to compare the expression levels of different genes there is a further concern. Some RNA-Seq protocols produce reads along the full length of each transcript, and more reads are obtained from longer transcripts. Accounting for transcript lengths, another unit called Transcripts Per Million (TPM) is sometimes used. You may also see mention of an earlier unit called FPKM or RPKM (Fragments/Reads Per Kilobase per Million).
297+
If you want to compare the expression levels of different genes there is a further concern. Some RNA-Seq protocols produce reads along the full length of each transcript, and more reads are obtained from longer transcripts. Accounting for transcript lengths, another unit called **Transcripts Per Million (TPM)** is sometimes used. You may also see mention of an earlier unit called FPKM or RPKM (Fragments/Reads Per Kilobase per Million).
298298

299299
Other RNA-Seq protocols only produce reads at the 3' ends of RNA transcripts. In this case CPM and TPM are the same.
300300

301-
These units are only for visualizing and reporting results. For differential expression analysis, the input is always raw counts. CPMs or TPMs are not raw counts and results will be meaningless if they are used with Degust.
301+
These units are only for visualizing and reporting results. **For differential expression analysis, the input is always raw counts. CPMs or TPMs are not raw counts and results will be meaningless if they are used with Degust.**
302302

303303
To find TPMs in the Laxy output, in the output pane you would navigate to `output/results/star_salmon` and download `salmon.merged.gene_tpm.tsv` or `salmon.merged.transcript_tpm.tsv`.
304304

@@ -309,9 +309,9 @@ To find TPMs in the Laxy output, in the output pane you would navigate to `outpu
309309

310310
### UMIs and counting
311311

312-
Modern RNA-Seq protocol tag fragments with a Unique Molecular Identifier (UMI) before PCR amplification. This allows each original RNA fragment to be counted once, even if it is seen in multiple reads (or read pairs).
312+
Modern RNA-Seq protocols tag fragments with a **Unique Molecular Identifier (UMI)** before PCR amplification. This allows each original RNA fragment to be counted once, even if it is seen in multiple reads (or read pairs).
313313

314-
**Absolute expression enthusiast says:** "I like UMI counting because my TPMs are not biassed due to PCR biasses."
314+
**Absolute expression enthusiast says:** "I like UMIs because my TPMs are no longer biassed due to PCR biasses."
315315

316316
**Differential expression ethusiast says:** "PCR bias doesn't worry me. Each gene has the same bias in each sample, so it cancels out when I look at fold change. What I do like is that UMI counting removes an extra source of noise."
317317

@@ -330,7 +330,7 @@ Even at the gene level, it is sometimes ambiguous where a read belongs. Salmon c
330330

331331
**Differential expression enthusiast says:** "Genes are easiest to work with. Trying to estimate differential transcript-level counts is a hard inference task, I really have to know what I'm doing and I'll need deeper sequencing too. I do worry a little that differential transcript usage might look like differential gene expression if the transcripts of a gene have different lengths, but it hasn't been a problem in practice."
332332

333-
"At the gene level," continues the differential expression enthusiast, becoming animated, "I did try using some TPM-abundance-based 'counts' the `nf-core/rnaseq` pipeline produced with this dataset, such as `salmon.merged.gene_counts_length_scaled.tsv`, and I noticed in the heatmap there were some extremely noisy genes. This artifactual large amount of noise in some genes can make the whole differential expression analysis worse, because it affects the 'Empirical Bayes' part of the analysis. Proper analysis might involve, for example, the `catchSalmon` function in `edgeR`, which makes use of bootstrap information provided by Salmon. This is not available in Degust."
333+
"At the gene level," continues the differential expression enthusiast, becoming animated, "I did try using some purported TPM-abundance-based counts that the `nf-core/rnaseq` pipeline produced with this dataset, such as `salmon.merged.gene_counts_length_scaled.tsv`, and I noticed in the heatmap there were some extremely noisy genes. This artifactual large amount of noise in some genes can make the whole differential expression analysis worse, because it affects the 'Empirical Bayes' part of the analysis. Proper analysis might involve, for example, the `catchSalmon` function in `edgeR`, which makes use of bootstrap information provided by Salmon. This is not available in Degust."
334334

335335

336336
### Other methods
@@ -343,8 +343,6 @@ With the default voom/limma method, all samples are assumed to have the same qua
343343

344344
Voom with sample weights allows that there might be some samples with lower quality. It assigns each sample a weight, i.e. it allows that some samples may have more variation than others. If your data contains poor quality samples, but you don't want to exclude them, this method might be used.
345345

346-
Exercise: Which of the QC plots can show us if there are poor quality samples?
347-
348346

349347
#### edgeR quasi-likelihood
350348

@@ -387,8 +385,9 @@ Degust provides a convenient interface for analysing RNA-Seq data, but some larg
387385
* Long time series.
388386
* A mixture of biological and technical replication.
389387
* Sources of unwanted variation that are not fully known.
388+
* A design with a before-treatment and an after-treatment sample collected from a set of individuals, with the individuals split into groups that each receive different treatments. (This design is surprisingly common, and it is very non-obvious how to analyse effectively. Talk to us if you're doing this!)
390389

391-
Experiments with these features can be analysed in R. Your first step is to understand the R way of specifying models and using models to perform statistical tests. We have some workshop material on this topic available here:
390+
Experiments with these features can be analysed in R. Your first step is to understand the R way of specifying linear models and using models to perform statistical tests. We have some workshop material on this topic available here:
392391

393392
* ["Linear models in R"](https://monashdatafluency.github.io/r-linear/)
394393

0 commit comments

Comments
 (0)