Skip to content

lcscs12345/HBV_splicing_paper_2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Decoding the interconnected splicing patterns of hepatitis B virus and host using large language and deep learning models

This repository contains the full pipeline for our manuscript, covering data preprocessing, statistical analysis, and embedding extraction, dimensional reduction, and clustering, and splice site predictions. All steps are documented within the provided Jupyter notebooks.

Setup and Installation

Here are the steps to re-create conda environments used in this study. For simplicity, these steps will install Miniforge3 in the home directory.

  1. Download and install Miniforge3
wget https://github.com/conda-forge/miniforge/releases/download/25.3.0-3/Miniforge3-25.3.0-3-Linux-x86_64.sh
bash Miniforge3-25.3.0-3-Linux-x86_64.sh -b -p $HOME/miniforge3 # can be installed elsewhere
eval "$(/$HOME/miniforge3/bin/conda shell.bash hook)"
  1. Download this GitHub repository
git clone https://github.com/lcscs12345/HBV_splicing_paper_2025.git
  1. Create conda environments
# conda environment with various utilities installed
conda env create --file environment_files/environment_utils.yml -p $HOME/miniforge3/envs/utils
# Fix pyfasta to make it compatible with python 3.13.5 and numpy 2.3.1
sh environment_files/fix_pyfasta.sh

# conda environment with OpenSpliceAI and dependencies installed
conda env create --file environment_files/environment_openspliceai.yml -p $HOME/miniforge3/envs/openspliceai

# conda environment for SpliceBERT dependencies
conda env create --file environment_files/environment_llm.yml -p $HOME/miniforge3/envs/llm

# NOTE: Alternatively, use micromamba as a faster drop-in replacement for conda to create environments
# Install micromamba
conda install -c conda-forge micromamba

# Example: using micromamba instead of conda
micromamba create -f environment_files/environment_utils.yml -p $HOME/micromamba/envs/utils

# Subsequent steps (activation and usage) remain the same as with conda.
  1. Download project files and SpliceBERT and OpenSpliceAI models from Zenodo and unzip them within this repository. The complete directory structure should look like:
HBV_splicing_paper_2025/
└── data/
└── environment_files/
└── jupyter_notebooks/
└── ref/ 
└── results/
└── scripts/
└── src/

Data Description

The data/processed_files/ directory contains intermediate processed data files generated during the various stages of the analysis pipeline. These files are typically derived from raw data through complex transformations and computations performed by Bash commands and Python scripts executed within the Jupyter notebooks. These directories and files will be present if the Zenodo archive has been successfully unpacked within the root repository directory.

Key output files include:

  • Cleaned Data Files (.csv, .pkl.gz): Data files in results/data.
  • Figure Files (.pdf, '.png'): Plots in results/figures.

For a comprehensive list and descriptions of all automatically detected project files, please refer to ./PROJECT_FILES.md.

Notebooks

Compares the proportions of spliced HBV RNAs and evaluates splice site-level splicing efficiency across liver biopsy tissues and cultured cell lines. It highlights that the splice site-level measurement may serve as a stronger biomarker than the overall proportions of HBV splicing.

  • Input Files:
    • data/huh7/SraRunTable.csv: Comma-separated values file with tabular data.
    • data/processed_files/cosi.hbv.txt: Tab-delimited text file with processed results.
    • data/processed_files/cosi.pkl.gz: Wide-format coSI scores for HBV splice donor and acceptor pairs.
    • data/processed_files/cosi_long.pkl.gz: Long-format coSI scores for HBV splice donor and acceptor sites.
    • data/processed_files/map.txt: Tab-delimited text file with processed results.
    • data/processed_files/mgen-7-492-s002.xlsx: Supplementary file from our previous study
    • data/processed_files/tcons.txt: Tab-delimited text file with processed results.
    • data/processed_files/track.lol.txt: Tab-delimited text file with processed results.
  • Output Files:
    • This notebook generates 6 'Generated plot or figure output.' files, 2 'Tab-delimited text file with processed results.' files, 1 'Final proportions of spliced HBV RNAs used in results.' files, 6 other files. For a comprehensive list, see the Project Files README.

Extracts nucleotide embeddings, applies dimensionality reduction, and performs clustering to uncover splice site sequence patterns in both HBV and host genomes.

  • Input Files:
    • This notebook uses 3 'Coordinate mapping file in MAFFT mapout format.' files, 3 'Splice site statistics across clusters.' files, 2 'Splice donor/acceptor site coordinates.' files, 7 other files. For a comprehensive list, see the Project Files README.
  • Output Files:
    • This notebook generates 12 'Splice donor/acceptor site coordinates.' files, 9 'Genomic features in BED format.' files, 7 'Generated plot or figure output.' files, 20 other files. For a comprehensive list, see the Project Files README.

Extracts nucleotide embeddings, applies dimensionality reduction, and performs clustering to uncover splice site sequence patterns in both HBV and host genomes.

  • Input Files:
    • data/processed_files/consensus_splice_sites.bed.gz: Consensus splice site coordinates.
    • data/processed_files/conss_nonhs.all.stats: Splice site statistics across clusters.
    • data/processed_files/conss_nonhs.cds.stats: Splice site statistics across clusters.
    • data/processed_files/crosstab_acceptors_perc_nonhs.pkl.gz: Cluster-level percentage of true splice site labels (donor/acceptor).
    • data/processed_files/crosstab_donors_perc_nonhs.pkl.gz: Cluster-level percentage of true splice site labels (donor/acceptor).
    • data/processed_files/cstats_nonhs.pkl.gz: Splice site statistics across clusters.
    • data/processed_files/exonicss_nonhs.hbv.txt: Mapped coSI scores to exonic splice sites.
    • data/processed_files/exonicss_nonhs.txt: Mapped coSI scores to exonic splice sites.
  • Output Files:
    • This notebook generates 7 'Generated plot or figure output.' files, 4 'Cluster-level percentage of true splice site labels (donor/acceptor).' files, 4 'Processed file generated via Bash or Python command in notebook.' files, 6 other files. For a comprehensive list, see the Project Files README.

Analyses splicing propensity and classifies HBV splice donor sites using SpliceBERT, leveraging transformer-based sequence representations to identify predictive features. Includes performance metrics evaluation and splice donor site conservation analysis to assess model accuracy and evolutionary constraints, respectively.

  • Input Files:
    • data/processed_files/cosi_long.pkl.gz: Long-format coSI scores for HBV splice donor and acceptor sites.
    • ref/hbvdb/pgrna/pgrna_flank200.txt: Tab-delimited text file with processed results.
  • Output Files:
    • ref/hbvdb/pgrna/pgrna_flank200.txt: Tab-delimited text file with processed results.
    • ref/hbvdb/pgrna/pgrna_flank200_GT-AA.txt: Tab-delimited text file with processed results.
    • results/figures/fig5/conservation_donors.png: Generated plot or figure output.
    • results/figures/fig5/logit_donors.png: Generated plot or figure output.

Analyses splicing propensity and classifies HBV splice acceptor sites using SpliceBERT, leveraging transformer-based sequence representations to identify predictive features. Includes performance metrics evaluation and splice acceptor site conservation analysis to assess model accuracy and evolutionary constraints, respectively.

  • Input Files:
    • data/processed_files/cosi_long.pkl.gz: Long-format coSI scores for HBV splice donor and acceptor sites.
    • ref/hbvdb/pgrna/pgrna_flank200.txt: Tab-delimited text file with processed results.
  • Output Files:
    • ref/hbvdb/pgrna/pgrna_flank200_AG-AA.txt: Tab-delimited text file with processed results.
    • results/figures/fig4/umap.logit_acceptors.leiden.png: Generated plot or figure output.
    • results/figures/fig4/umap.logit_acceptors.png: Generated plot or figure output.
    • results/figures/fig5/conservation_acceptors.png: Generated plot or figure output.
    • results/figures/fig5/logit_acceptors.png: Generated plot or figure output.

Integrates a deep learning-based framework into the study by performing splice site classification with OpenSpliceAI. Includes performance metrics evaluation to assess model accuracy.

  • Input Files:
    • data/processed_files/cosi_long.pkl.gz: Long-format coSI scores for HBV splice donor and acceptor sites.
    • ref/hbvdb/pgrna/pgrna_flank200.txt: Tab-delimited text file with processed results.
  • Output Files:
    • results/figures/fig5/openspliceai_acceptors.png: Generated plot or figure output.
    • results/figures/fig5/openspliceai_donors.png: Generated plot or figure output.

Helps generate the frequency of splice site motifs across genomic sequences.

  • Input Files:
    • data/processed_files/cosi_long.pkl.gz: Long-format coSI scores for HBV splice donor and acceptor sites.
    • data/processed_files/hbvdb_pgrna_mc3.bed: Genomic features in BED format.
    • data/processed_files/hbvdb_pgrna_mc3.pkl.gz: Processed file generated via Bash or Python command in notebook.
    • data/processed_files/hbvdb_pgrna_mc5.bed: Genomic features in BED format.
    • data/processed_files/hbvdb_pgrna_mc5.pkl.gz: Processed file generated via Bash or Python command in notebook.
  • Output Files:
    • This notebook generates 10 'FASTA file containing nucleotide sequences.' files, 6 'Genomic features in BED format.' files, 6 'Mapped coSI scores to exonic splice sites.' files, 6 other files. For a comprehensive list, see the Project Files README.

Scripts

  • scripts/common.py: Utility functions for reading fasta sequences, motif extraction, and calculate performance metrics for a classifier.
  • scripts/dicts.py: Dictionaries for HBV and human splice sites.
  • scripts/generate_readme.py: Functions to generate README files automatically.
  • scripts/openspliceai_helpers.py: Wrapper for OpenSpliceAI to generate and format splicing predictions from input sequences using PyTorch and pandas.
  • scripts/splicebert_helpers.py: Wrapper for SpliceBERT splice site classification using HuggingFace Transformers with a sliding window approach, alongside utilities for identifying non-splice sites and calculating normalised mutual information (NMI).
  • scripts/track.R: R functions to generate track plots for HBV splice variants and splicing efficiency at each splice site in the viral genome.

About

Analysis pipeline for decoding hepatitis B virus and host splicing patterns using AI models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors