RNA Seq Workflow

In this tutorial we are following a mixture of this excellent tutorial by Michael Love et al. and this more recent introductory tutorial by Tommy Tang. The main reason for this is lack of set up explained in the bioconductor tutorial as they assume you already have the dataset as well as referring to an R script that no longer exists.

Although the tutorial is great, it skips over some of the key steps that are difficult for beginners, including data acquisition. Please read up to the following part section 3.2 Salmon quantification:

salmon index -i gencode.v27_salmon_0.8.2 -t gencode.v27.transcripts.fa.gz

Before running the code above, we need to set up the environment and acquire the files that are needed.

It's important to note that the files will be quite large (~40 GB) and so it is advised to use the group's HPC - but lab PC's should be able to handle it.

Setting Up

Downloading required files and data

First, we need to download the sequence data.

The data used in this tutorial is GEO entry GSE52778. We will acquire the data from ENA, rather than GEO as it is more convenient. Also download the metadata as a tsv and upload to the HPC as we will be using it later to label the samples properly in R.

touch ena-download-read-run-SRP033351-fastq-ftp.sh
# then edit the file with nano
nano ena-download-read-run-SRP033351-fastq-ftp.sh

Once in the file, copy the code generated by ENA below and paste it into the file, save and exit:

#!/bin/bash
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/005/SRR1039515/SRR1039515_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/006/SRR1039516/SRR1039516_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039513/SRR1039513_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/006/SRR1039516/SRR1039516_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039522/SRR1039522_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/005/SRR1039515/SRR1039515_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039513/SRR1039513_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039512/SRR1039512_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039523/SRR1039523_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039522/SRR1039522.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039513/SRR1039513.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/005/SRR1039515/SRR1039515.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039512/SRR1039512_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039523/SRR1039523_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/006/SRR1039516/SRR1039516.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039522/SRR1039522_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_1.fastq.gz

Make the script executable

chmod u+x ena-download-read-run-SRP033351-fastq-ftp.sh

If you are using an HPC, use a terminal multiplexer such as tmux or screen. The EE HPC has tmux whereas Burgundy has screen.

tmux new -s rna_seq_analysis

./ena-download-read-run-SRP033351-fastq-ftp.sh

Note: this current method does not check whether the data was downloaded correctly.

Next, we need to download the reference genome that we will align the reads to.

mkdir reference
cd reference
# donwload the required file
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.transcripts.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.basic.annotation.gtf.gz

Check the gtf file

zless -S gencode.v45.annotation.gtf.gz | grep -v "#" | awk '$3=="gene"' | cut -f9 | head -3

You can then view the number of time each gene type appears

zless -S gencode.v45.annotation.gtf.gz | grep -v "#" | awk '$3=="gene"' | cut -f9 | cut -f2 -d ";" | sort | uniq -c | sort -k1,1nr

You may want to do a fastqc for quality control of the reads and trimming with fastp for the sequencing adapters. We will skip it in this tutorial.

Set Up Conda

We also need to install salmon which can be done through conda.

Instructions to install via conda according to their website

conda config --add channels conda-forge
conda config --add channels bioconda
conda create -n salmon salmon

This will install the latest salmon in its own conda environment. The environment can then be activated via:

conda activate salmon

Create the index

salmon index -t gencode.v45.transcripts.fa.gz -i gencode.v45_human_index -k 31 --gencode

Quantifcation

The authors point to an R script for quantification, however this file is no longer available. Therefore, we will use bash to do this step.

# Create an empty script
touch quant-files.sh
# Edit the script with nano
nano quant-files.sh

Paste the following, save and exit:

#!/bin/bash

# Relative path from raw-seq-data to the index
salmon_index="../reference/gencode.v45_human_index"

# Loop through all samples
for sample in SRR1039508 SRR1039509 SRR1039510 SRR1039511 SRR1039512 SRR1039513 SRR1039514 SRR1039515 SRR1039516 SRR1039517 SRR1039518 SRR1039519 SRR1039520 SRR1039521 SRR1039522 SRR1039523
do
    echo "Processing sample: $sample"
    salmon quant -i $salmon_index -p 6 --libType A \
      --gcBias \
      -1 ${sample}_1.fastq.gz -2 ${sample}_2.fastq.gz \
      -o $sample
done

# Make it executable
chmod u+x quant-files.sh
# Run the script
./quant-files.sh

Check that we have the output files

 find . -name "*sf"

Check the mapping rate of the output

find . -name "salmon_quant.log" | xargs grep "Mapping rate"

Importing the data to R with tximport

In this section, the code is to be completed using R. When using EE server, it's advised to use jupyter notebook with your conda R environment as this is currently the only way to have notebooks on the server - we want the notebooks so that we can see the output of the code i.e. graphs as that will give us a feel for the data. For example, to see the output without a notebook we would have to manually save each graph in our directories first e.g.

Connect to the server through ssh inside vscode and run the code in the 'rna-seq-analysis.ipynb' file there:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNA Seq Workflow

Setting Up

Downloading required files and data

Set Up Conda

Create the index

Quantifcation

Importing the data to R with tximport

FilesExpand file tree

steps.md

Latest commit

History

steps.md

File metadata and controls

RNA Seq Workflow

Setting Up

Downloading required files and data

Set Up Conda

Create the index

Quantifcation

Importing the data to R with tximport