RNA sequencing (RNA-Seq) employs high-troughtput sequencing technologies to unravel the transcriptomic profile of living organisms. The raw sequencing dataset requires a series of elegant and sofisticated pieces of bioinformatics softwares to gather usefull and insightful information on biological data. Here I'll describe a basic RNA-Seq pipeline using snakemake workflow language for automation.
The softwares required to run this analysis are contained in a conda environment. If you don't have anaconda installed in your machine, a basic tutorial can be foud here. Otherwise, you can simply import an enviroment that I've already made using:
conda env create --file environment.yaml # Create rnaseq environment
conda activate rnaseq # enters the recently created environmentThis command will use the enviroment.yaml file and create a enviroment called rnaseq containing all the programs you'll need.
The RNA-Seq data that we're going to use is described in Sousa et al., 2019. In this paper, the autors describes DEG of T cells stimulated by different versions of OKT3, an anti-CD3 antibody. This data is pubicly available in SRA with the study code SRP139131. For matters of simplification we'll only use the control and the T cells treated with OKT3 data. However, feel free to use the entire dataset or any other piece of data that may be of your interest.
Use the following command to create our (sub)directories:
mkdir -p {trimData,rawData,qcData}Now, inside rawData directory, download our work data from SRA using the following command (In case of having personal data at dispose, ignore this step):
cat SRR_Acc_List.txt | parallel "fastq-dump --gzip --split-files {}"The reference genome inside kallisto directory needs to be unzipped (Once this file is unzip, there will be no need to unzip it again):
gunzip *.gzTo run our analysis we'll need to execute the snakefile containing all the RNA-Seq steps. In your terminal, use:
snakemake -s snakefile -j 8snakemake command will execute our snakefile file. The number of cores after -j will depend on your machine or server CPU capability.
- Before executing
snakefileensure that all the data and directories needed are as described earlier. - If you try to use your own personal data, be aware that the
config.yamlfile needs to be modified. You should only insert the files name/id, instead of something like{sample_name}_1.fastq.gz. Any doubt, just look at theconfig.yamlin this repository. - There is also the need to set the full paths where the directories are in your system in the
config.yamlfile. If it's not set correctly, the input files and their respective outputs won't be found or sent to the right place, acording to thesnakefilecode logic. Any doubts, look theconfig.yamlin this repository. - If you are interested to only run a specific step of the
snakefile, just specify the name of the rule you are interest to run. Example:
snakemake --core 8 fastqc