APA-Net is a deep learning model designed for learning context-specific APA (Alternative Polyadenylation) usage. This guide covers the steps necessary to set up and run APA-Net.
- Python 3.8 or higher
- PyTorch 1.8.0 or higher
- NumPy
- Pandas
- SciPy
- tqdm
- wandb (optional, for experiment tracking)
- Clone this repository to your local machine:
git clone https://github.com/BaderLab/APA-Net.git
cd APA-Net
- Install dependencies manually for better control:
# For CPU-only version (smaller download)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# For GPU version (if you have CUDA)
pip install torch torchvision torchaudio
# Install other dependencies
pip install numpy pandas scipy tqdm wandb
- Install the package:
pip install .
pip install .
Note: This will install the full PyTorch with CUDA support, which is a large download (~2GB).
APA-Net expects input data in .npy
format with the following structure:
- Shape:
(n_samples, 9)
where each row represents one sample - Columns:
- Column 0: Float value (sample ID/index)
- Column 1: String (cell type name)
- Column 2: String (additional metadata)
- Column 3: Float value
- Column 4: String (additional metadata)
- Column 5: String (genomic coordinates/switch name)
- Column 6: NumPy array of shape
(4, 4000)
- one-hot encoded DNA sequence - Column 7: Float (target APA usage value)
- Column 8: NumPy array of shape
(327,)
- cell type profile features
To train the APA-Net model, use the train_script.py script:
cd apamodel
python train_script.py \
--train_data "/path/to/train_data.npy" \
--valid_data "/path/to/valid_data.npy" \
--modelfile "/path/to/model_output.pt" \
--batch_size 64 \
--epochs 200 \
--device "cpu" \
--use_wandb "False"
You can test the model with sample data:
# Create a simple test script
python -c "
import sys
sys.path.append('./apamodel')
from model import APANET, APAData
import numpy as np
import torch
# Load your data
data = np.load('your_data.npy', allow_pickle=True)
# Configure model (using CPU)
config = {
'device': 'cpu',
'opt': 'Adam',
'loss': 'mse',
'lr': 2.5e-05,
'adam_weight_decay': 0.09,
'conv1kc': 128,
'conv1ks': 12,
'conv1st': 1,
'pool1ks': 16,
'pool1st': 16,
'cnvpdrop1': 0,
'Matt_heads': 8,
'Matt_drop': 0.2,
'fc1_dims': [8192, 4048, 1024, 512, 256],
'fc1_dropouts': [0.25, 0.25, 0.25, 0, 0],
'fc2_dims': [128, 32, 16, 1],
'fc2_dropouts': [0.2, 0.2, 0, 0],
'psa_query_dim': 128,
'psa_num_layers': 1,
'psa_nhead': 1,
'psa_dim_feedforward': 1024,
'psa_dropout': 0
}
# Create and test model
model = APANET(config)
model.compile()
print('Model created successfully!')
"
--train_data
: Path to the training data file (required)--valid_data
: Path to the validation data file (required)--modelfile
: Path where the trained model will be saved (required)--batch_size
: Batch size for training (default: 64)--epochs
: Number of training epochs (default: 200)--project_name
: Name of the project for wandb logging (default: "APA-Net_Training")--device
: Device to run the training on - use "cpu" or "cuda:0" (default: "cuda:0")--use_wandb
: Enable wandb logging - "True" or "False" (default: "True")
APA-Net is a deep neural network that combines:
- Convolutional layers for sequence feature extraction
- Self-attention mechanism for capturing long-range dependencies
- Fully connected layers for prediction
- Cell type profile integration for context-specific modeling
The model has approximately 301M parameters and processes:
- Input: DNA sequences (4×4000) + cell type profiles (327 features)
- Output: APA usage prediction (single value)
-
CUDA errors: If you encounter CUDA-related errors, install the CPU-only version of PyTorch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-
Memory issues: Reduce batch size if you encounter out-of-memory errors:
--batch_size 32
-
Data format errors: Ensure your data has the correct shape
(n_samples, 9)
with sequences of shape(4, 4000)
and cell type profiles of shape(327,)
.
- CPU: Slower but more compatible. Use
--device "cpu"
- GPU: Faster training. Use
--device "cuda:0"
(requires CUDA-compatible PyTorch installation)
Here's a complete example of training APA-Net:
# Navigate to the model directory
cd APA-Net/apamodel
# Train the model
python train_script.py \
--train_data "../test_fold_0.npy" \
--valid_data "../test_fold_0.npy" \
--modelfile "./trained_model.pt" \
--batch_size 32 \
--epochs 50 \
--device "cpu" \
--use_wandb "False" \
--project_name "APA-Net_Test"
The analysis_and_figures/
directory contains all the code and notebooks used to reproduce the results and figures from our APA-Net research paper. This comprehensive analysis pipeline covers data processing, model evaluation, comparative analysis, and visualization.
analysis_and_figures/
├── model_performance/ # APA-Net model evaluation and performance analysis
├── data_processing/ # Data preparation and preprocessing for APA-Net
├── comparative_analysis/ # Comparative studies (APA vs DE, correlations)
├── gene_expression/ # Differential gene expression analysis
├── pathway_analysis/ # Gene set enrichment and pathway analysis
├── preprocessing/ # Single-cell RNA-seq data preprocessing pipeline
└── functions/ # Utility functions and helper scripts
- Prerequisites: Make sure you have the following R and Python packages installed:
R packages:
install.packages(c("dplyr", "ggplot2", "tidyr", "viridis", "patchwork",
"readxl", "gridExtra", "ggpubr", "ggrepel", "reshape2",
"corrplot", "pheatmap", "boot", "Seurat", "scCustomize"))
Python packages:
pip install pandas numpy scipy matplotlib seaborn scikit-learn statsmodels
- Data Requirements: The analysis scripts expect data in specific locations. You may need to adjust file paths in the notebooks to match your data directory structure.
- APA-NET_performance_plots.ipynb: Generates correlation plots showing model performance across cell types
- APA-Net_filter_interactions.ipynb: Analyzes convolutional filter interactions and RBP binding patterns
- APA-Net_heatmap_for_filter_interactions.ipynb: Creates heatmaps showing filter-RBP interactions
- Process_inputs_for_APA-Net.ipynb: Main data preprocessing pipeline for APA-Net training data
- Processes RNA sequences and APA usage data
- Generates one-hot encoded sequences
- Creates 5-fold cross-validation splits
- Formats data for model training
- APA_quantification_maaper_apalog_Dec2024.ipynb: APA event quantification using MAAPER
- emprical_fdr_thresholds_maaper_apalog.ipynb: Determines empirical FDR thresholds for significance testing
- APA_vs_DE.ipynb: Compares APA changes with differential expression
- Correlation analysis between APA usage and gene expression changes
- Cell-type-specific comparisons
- Statistical significance testing
- apa_correlation_across_celltypes.ipynb: Cross-cell-type APA correlation analysis
- rbp_co_occurance_dissimilarity.ipynb: RNA-binding protein co-occurrence analysis
- DEG_ALS_genes.R: Analysis of ALS-associated gene expression
- DEG_MAST_analysis.R: MAST-based differential expression analysis
- DEG_pathway_analysis.R: Pathway enrichment analysis for DEGs
- DEG_visualization.R: Visualization of differential expression results
- APA_pathway_analysis.R: Gene set enrichment analysis for APA-affected genes
- GO term enrichment
- Reactome pathway analysis
- Custom gene set analysis
- processing_annotation/: Single-cell RNA-seq processing pipeline
01_snRNA_cellranger_preprocess.sh
: Cell Ranger preprocessing02_snRNA_process_QC.R
: Quality control and filtering03_snRNA_clustering_annotation.R
: Cell clustering and annotation04a_snRNA_NSForest1.ipynb
&04b_snRNA_NSForest2.ipynb
: NSForest cell type classification
- independent_datasets/: Processing of additional validation datasets
01_read_matrices.R
: Matrix reading and preprocessing02_harmony_int.R
: Harmony integration for batch correction03_doublet_removal_annotation.R
: Doublet detection and removal
To reproduce the main figures from the paper:
-
Model Performance Plots:
cd analysis_and_figures/model_performance jupyter notebook APA-NET_performance_plots.ipynb
-
APA Usage Analysis:
cd analysis_and_figures/visualization jupyter notebook maaper_volcanos_barplots_figure6.ipynb
-
Comparative Analysis:
cd analysis_and_figures/comparative_analysis jupyter notebook APA_vs_DE.ipynb
To process your own data through the complete pipeline:
-
Start with raw single-cell data:
cd analysis_and_figures/preprocessing/processing_annotation bash 01_snRNA_cellranger_preprocess.sh
-
Process and prepare for APA-Net:
cd analysis_and_figures/data_processing jupyter notebook Process_inputs_for_APA-Net.ipynb
- Model Performance: APA-Net achieves correlation coefficients of 0.56-0.67 across cell types
- Cell-Type Specificity: Microglia show highest model performance, indicating stronger APA regulatory patterns
- Condition Comparison: Strong correlations (0.65-0.84) between C9ALS and sALS APA changes across cell types
- Biological Validation: APA changes correlate with known ALS pathways and RBP targets
The analysis scripts reference several data sources:
- Single-cell RNA-seq count matrices
- APA usage quantification results
- Cell type annotations
- RBP expression profiles
- Reference genome and annotations
Please ensure you have access to the appropriate datasets before running the analysis scripts.
If you use this analysis pipeline, please cite our paper:
[[Paper citation to be added upon publication]](https://www.biorxiv.org/content/10.1101/2023.12.22.573083v2)
For questions about the analysis pipeline, please open an issue in the GitHub repository.