Project period: December 26–30, 2025
This project presents an end-to-end RNA-seq analysis investigating the host transcriptional response to Influenza A virus (WSN/33) infection.
The workflow integrates experimental design awareness with reproducible computational steps to characterize virus-induced changes in host gene expression, with a particular focus on interferon-mediated antiviral responses.
How does Influenza A virus infection alter host gene expression, and which components of the innate immune response are transcriptionally activated in infected cells compared to mock controls? Rather than focusing on global differential expression alone, the analysis prioritizes biologically interpretable antiviral pathways, with an emphasis on interferon-stimulated genes (ISGs).
- Host: Human (GENCODE v49, GRCh38)
- Virus: Influenza A virus (WSN/33 strain)
- Conditions:
mock: uninfected controlvirus: Influenza A–infected samples
- Sequencing: RNA-seq (paired-end)
This analysis is based on publicly available bulk RNA-seq data from human samples infected with Influenza A virus, originally published in:
Ashraf U. et al. (2020). Influenza virus infection induces widespread alterations of host cell splicing.
GEO accession: GSE155241
The repository is organized to clearly separate metadata, references, scripts, and results.
-
config/
Contains configuration files used by shell scripts (paths, parameters). -
data/metadata/
Experimental metadata (samples.csv) describing samples, conditions, and sequencing runs. -
ref/
Reference files used for analysis, including:- a combined host–virus FASTA (human GENCODE v49 + Influenza A WSN/33),
- the GENCODE v49 annotation (GTF),
- a transcript-to-gene mapping file (
tx2gene).
-
scripts/
Modular shell and R scripts implementing each step of the workflow, from data retrieval to visualization. -
results/
Processed outputs and final results, including:- figures/: PCA, volcano plot, and ISG heatmap,
- host_gene/: gene-level differential expression results,
- host_matrix/: host-only expression matrices.
Large intermediate files (raw FASTQ, Salmon indices, quantification outputs) are intentionally excluded.
Raw RNA-seq data were retrieved using SRA Toolkit and ENA, ensuring robustness against network instability.
Sequencing quality was assessed using:
- FastQC for individual samples
- MultiQC for aggregated reports
A combined host–virus reference was constructed by merging:
- the human transcriptome (GENCODE v49),
- the Influenza A (WSN/33) genome.
This approach enables simultaneous quantification of host and viral transcripts.
Transcript-level quantification was performed using Salmon on the combined reference.
Host and viral transcripts were quantified together, after which viral transcripts were excluded during downstream host gene-level analysis.
Host-only expression matrices were generated by summarizing transcript-level estimates to gene level using a curated tx2gene mapping derived from GENCODE v49.
Gene-level differential expression was performed with DESeq2, using an explicit contrast:
- virus vs mock
Genes with positive log2 fold change are transcriptionally induced by viral infection, while negative values indicate repression.
TTo explore and interpret host responses:
- A heatmap of interferon-stimulated genes (ISGs) was used to highlight coordinated antiviral responses across samples,
- Volcano plots were used to contextualize the magnitude and direction of differential expression,
- PCA was applied to variance-stabilized counts to assess global transcriptional differences between conditions.
Principal component analysis reveals a clear separation between mock and virus samples, indicating a strong virus-driven transcriptional effect.
The analysis identifies robust induction of classical interferon-stimulated genes (ISGs), including:
- CXCL10
- IFIT family genes (IFIT1, IFIT2, IFIT3)
- ISG15
- IFI27
These genes are hallmarks of early innate immune activation and antiviral defense.
Genes with higher expression in mock samples likely represent baseline cellular processes that are transcriptionally reprogrammed upon infection.
Together, these results illustrate how biologically informed transcriptomic analyses can be used to extract coherent antiviral immune programs from bulk RNA-seq data.
All analytical steps are implemented as modular scripts, allowing the full workflow to be rerun from raw data if needed.
Large intermediate files (raw reads, indices, quantification outputs) are intentionally excluded from version control, while all scripts and final results required for interpretation are provided.
- R (≥ 4.2) with packages:
DESeq2,tximport,ggplot2,ggrepel,pheatmap,dplyr,readr - Salmon for transcript quantification
- FastQC / MultiQC for quality control
Yasmina Soumahoro
Biologist | Bioinformatics | Host–Pathogen Transcriptomics