In this tutorial we are following a mixture of this excellent tutorial by Michael Love et al. and this more recent introductory tutorial by Tommy Tang. The main reason for this is lack of set up explained in the bioconductor tutorial as they assume you already have the dataset as well as referring to an R script that no longer exists.
Although the tutorial is great, it skips over some of the key steps that are difficult for beginners, including data acquisition. Please read up to the following part section 3.2 Salmon quantification:
salmon index -i gencode.v27_salmon_0.8.2 -t gencode.v27.transcripts.fa.gzBefore running the code above, we need to set up the environment and acquire the files that are needed.
It's important to note that the files will be quite large (~40 GB) and so it is advised to use the group's HPC - but lab PC's should be able to handle it.
First, we need to download the sequence data.
The data used in this tutorial is GEO entry GSE52778. We will acquire the data from ENA, rather than GEO as it is more convenient. Also download the metadata as a tsv and upload to the HPC as we will be using it later to label the samples properly in R.
touch ena-download-read-run-SRP033351-fastq-ftp.sh
# then edit the file with nano
nano ena-download-read-run-SRP033351-fastq-ftp.shOnce in the file, copy the code generated by ENA below and paste it into the file, save and exit:
#!/bin/bash
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/005/SRR1039515/SRR1039515_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/006/SRR1039516/SRR1039516_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039513/SRR1039513_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/006/SRR1039516/SRR1039516_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039522/SRR1039522_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/005/SRR1039515/SRR1039515_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/004/SRR1039514/SRR1039514_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039513/SRR1039513_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039512/SRR1039512_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039523/SRR1039523_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039522/SRR1039522.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039513/SRR1039513.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/005/SRR1039515/SRR1039515.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039510/SRR1039510_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039519/SRR1039519_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039511/SRR1039511_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039512/SRR1039512_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/009/SRR1039509/SRR1039509_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/007/SRR1039517/SRR1039517_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/003/SRR1039523/SRR1039523_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/000/SRR1039520/SRR1039520_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/006/SRR1039516/SRR1039516.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/001/SRR1039521/SRR1039521_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/002/SRR1039522/SRR1039522_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039518/SRR1039518_1.fastq.gzMake the script executable
chmod u+x ena-download-read-run-SRP033351-fastq-ftp.shIf you are using an HPC, use a terminal multiplexer such as tmux or screen. The EE HPC has tmux whereas Burgundy has screen.
tmux new -s rna_seq_analysis./ena-download-read-run-SRP033351-fastq-ftp.shNote: this current method does not check whether the data was downloaded correctly.
Next, we need to download the reference genome that we will align the reads to.
mkdir reference
cd reference
# donwload the required file
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.transcripts.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.basic.annotation.gtf.gzCheck the gtf file
zless -S gencode.v45.annotation.gtf.gz | grep -v "#" | awk '$3=="gene"' | cut -f9 | head -3You can then view the number of time each gene type appears
zless -S gencode.v45.annotation.gtf.gz | grep -v "#" | awk '$3=="gene"' | cut -f9 | cut -f2 -d ";" | sort | uniq -c | sort -k1,1nrYou may want to do a fastqc for quality control of the reads and trimming with fastp for the sequencing adapters. We will skip it in this tutorial.
We also need to install salmon which can be done through conda.
Instructions to install via conda according to their website
conda config --add channels conda-forge
conda config --add channels bioconda
conda create -n salmon salmonThis will install the latest salmon in its own conda environment. The environment can then be activated via:
conda activate salmonsalmon index -t gencode.v45.transcripts.fa.gz -i gencode.v45_human_index -k 31 --gencodeThe authors point to an R script for quantification, however this file is no longer available. Therefore, we will use bash to do this step.
# Create an empty script
touch quant-files.sh
# Edit the script with nano
nano quant-files.shPaste the following, save and exit:
#!/bin/bash
# Relative path from raw-seq-data to the index
salmon_index="../reference/gencode.v45_human_index"
# Loop through all samples
for sample in SRR1039508 SRR1039509 SRR1039510 SRR1039511 SRR1039512 SRR1039513 SRR1039514 SRR1039515 SRR1039516 SRR1039517 SRR1039518 SRR1039519 SRR1039520 SRR1039521 SRR1039522 SRR1039523
do
echo "Processing sample: $sample"
salmon quant -i $salmon_index -p 6 --libType A \
--gcBias \
-1 ${sample}_1.fastq.gz -2 ${sample}_2.fastq.gz \
-o $sample
done# Make it executable
chmod u+x quant-files.sh
# Run the script
./quant-files.shCheck that we have the output files
find . -name "*sf"Check the mapping rate of the output
find . -name "salmon_quant.log" | xargs grep "Mapping rate"In this section, the code is to be completed using R. When using EE server, it's advised to use jupyter notebook with your conda R environment as this is currently the only way to have notebooks on the server - we want the notebooks so that we can see the output of the code i.e. graphs as that will give us a feel for the data. For example, to see the output without a notebook we would have to manually save each graph in our directories first e.g.
Connect to the server through ssh inside vscode and run the code in the 'rna-seq-analysis.ipynb' file there: