Skip to content

NICHD-BSPC/aksenova-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This repo contains the source code corresponding to the full analysis in Aksenova et al., 2026.

The analyses are run through a Snakemake pipeline. The corresponding python scripts, configuration files and downstream RMarkdown files are included.

The pipelines are broken down into sections:

  • processing of RNAseq samples
  • downstream Differential Expression Analysis of bulk RNA samples
  • downsrteam Differential usage of exons and introns of bulk RNA samples

Requirements

  • conda must be installed
  • This code repository must be downloaded and unpacked. Below, we refer to the unpacked directory as $WORKDIR. Unless otherwise specified, paths provided are assumed to be relative to $WORKDIR.
  • Either use a snakemake profile appropriate for your cluster if running on a cluster, or run on a single machine (specifying the number of cores; here we use the placeholder $CORES).

Create conda environments

Create the conda environments in the top level of $WORKDIR:

# to run the snakemake pipeline
conda env create -y -p env --file env.yml

# to run the downstream differential gene expression analysis
conda env create -y -p env-r --file env-r.yml

# to run the downstream differential usage of exons and introns analysis
conda env create -y -p env-dexseq --file env-dexseq.yml

Run bulk RNAseq workflow

All RNAseq samples are processed through a single snakemake pipeline, based on the lcdb-wf github repository. This Snakefile orchestrates the steps of QC filtering, transcript quantification with Kallisto, alignment to the genome with Hisat2, building the annotation reference for merged exons or merged introns (referred to as flattened annotation), quantification in exons or introns with featureCounts, and conversion of signal to BigWig files for eventual visualization in genome viewer.

The Kallisto counts will then be used for the differential gene expression analysis; the exons and introns featureCounts quantification will be used for differential exon and intron usage analysis.

The workflows/rnaseq/config/config.yaml and workflows/rnaseq/config/sampletable.tsv files have already been configured, but can be edited to match other samples and experimental design. The config.yaml file specifies parameters such as the genome reference to download, the annotation file. See lcdb-wf configuration for details.

To run the workflow, navigate to the workflows/rnaseq directory, then activate the environment and run the Snakefile with the following commands:

# activate the environment
conda activate ../../env

# run the rnaseq workflow
snakemake --cores $CORES --use-conda

Workflow outputs include BigWig files found in workflows/rnaseq/data/rnaseq_samples/{samplename}/{samplename}.cutadapt.bam.{strand}.bigwig, transcript quantification files found in workflows/rnaseq/data/rnaseq_samples/{samplename}/{samplename}.kallisto/abundance.tsv and exon or intron quantification files found in workflows/rnaseq/data/rnaseq_aggregation/featurecounts_exons.txt and workflows/rnaseq/data/rnaseq_aggregation/featurecounts_introns.txt respectively.

Run downstream differential expression analysis

This uses as input the files workflows/rnaseq/data/rnaseq_samples/{samplename}/{samplename}.kallisto/abundance.tsv generated by the bulk RNAseq workflow.

Transcript-level counts are loaded into a DESeq2 object, converted to gene-level counts and differential expression is performed with DESeq2. Various genes metrics are calculated, such as GC content and number of exons. Differentially expressed genes are clustered with the R packages Mclust and kmeans and displayed as clustered heatmaps.

To run the workflow, navigate to the workflows/rnaseq/downstream-gene-expression directory, then activate the environment and run the Rmarkdown script with the following commands:

# activate the environment
conda activate ../../../env-r

# run the differential expression workflow
Rscript -e "rmarkdown::render('rnaseq.Rmd')"

The output files are:

  • summary report workflows/rnaseq/downstream-gene-expression/rnaseq.html

  • TSV table containing the genes metrics, log2FoldChange and adjusted p-values from all contrasts and the cluster assignment workflows/rnaseq/downstream-gene-expressioni/KD/all_samples_all_samples_significant_genes.tsv

  • clustered heatmap of significant DE genes LFCs workflows/rnaseq/downstream-gene-expression/KD/Clustering_all_samples_all_samples_significant_genes.pdf

Run downstream differential exon and intron usage analysis

This uses as input the files workflows/rnaseq/data/rnaseq_aggregation/featurecounts_exons.txt and workflows/rnaseq/data/rnaseq_aggregation/featurecounts_introns.txt generated by the bulk RNAseq workflow.

Flatenned-exon and flattenened-intron counts are loaded into a DEXSeq object, and tested for differential exon - or intron - usage with DEXSeq. Results for exons and introns are aggregated, and ratios of significant exon + introns over the total number of exons and introns are clustered with the R packages Mclust and kmeans and displayed as clustered heatmaps.

To run the workflow, navigate to the workflows/rnaseq/downstream-differential-usage directory, then activate the environment and run the Rmarkdown scripts with the following commands:

# activate the environment
conda activate ../../../env-dezseq

# run the differential exon usage; this produces intermediate .Rds files
Rscript -e "rmarkdown::render('differential-exon-usage.Rmd')"

# run the differential intron usage; this produces intermediate .Rds files
Rscript -e "rmarkdown::render('differential-intron-usage.Rmd')"

# run the aggregation script
Rscript -e "rmarkdown::render('aggregation-exon-intron-results.Rmd')"

The output files are:

  • summary report workflows/rnaseq/downstream-differential-usage/aggregation-exon-intron-results.html

  • TSV table containing the cluster assignment and the ratio of significant exon and intron vs. total number in each strain workflows/rnaseq/downstream-differential-usage/aggregated_results/Clustering_ratio_sig_exon_or_intron_per_gene_concatenated_per_condition_{strains}.tsv

  • clustered heatmap of significant exon and intron ratios workflows/rnaseq/downstream-differential-usage/aggregated_results/Clustering_ratio_sig_exon_or_intron_per_gene_concatenated_per_condition_{strain}.pdf

Note on reproducibility

Despite careful use of seed in R and conda environments to freeze tools versions, we noticed non-deterministic behaviors in some tools.

This was observed in Kallisto quant, HISAT2 when reporting the number of hits for multimapping reads, and in the cluster assignment by the R package kmeans. This may results in variations in results between pipeline runs that cannot be avoided.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors