|
1 | 1 | # APA-Net |
2 | 2 |
|
3 | | -APA-Net is a deep learning model designed for learning context specific APA usage. This guide covers the steps necessary to set up and run APA-Net. |
| 3 | +APA-Net is a deep learning model designed for learning context-specific APA (Alternative Polyadenylation) usage. This guide covers the steps necessary to set up and run APA-Net. |
| 4 | + |
| 5 | +## Requirements |
| 6 | + |
| 7 | +- Python 3.8 or higher |
| 8 | +- PyTorch 1.8.0 or higher |
| 9 | +- NumPy |
| 10 | +- Pandas |
| 11 | +- SciPy |
| 12 | +- tqdm |
| 13 | +- wandb (optional, for experiment tracking) |
4 | 14 |
|
5 | 15 | ## Installation |
6 | 16 |
|
7 | | -Before running APA-Net, ensure you have Python installed on your system. Clone this repository to your local machine: |
| 17 | +### Option 1: Install from source (Recommended) |
8 | 18 |
|
| 19 | +1. Clone this repository to your local machine: |
9 | 20 | ```bash |
10 | 21 | git clone https://github.com/BaderLab/APA-Net.git |
11 | 22 | cd APA-Net |
| 23 | +``` |
| 24 | + |
| 25 | +2. Install dependencies manually for better control: |
| 26 | +```bash |
| 27 | +# For CPU-only version (smaller download) |
| 28 | +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu |
| 29 | + |
| 30 | +# For GPU version (if you have CUDA) |
| 31 | +pip install torch torchvision torchaudio |
12 | 32 |
|
| 33 | +# Install other dependencies |
| 34 | +pip install numpy pandas scipy tqdm wandb |
| 35 | +``` |
| 36 | + |
| 37 | +3. Install the package: |
| 38 | +```bash |
13 | 39 | pip install . |
| 40 | +``` |
| 41 | + |
| 42 | +### Option 2: One-command installation |
| 43 | +```bash |
| 44 | +pip install . |
| 45 | +``` |
| 46 | +*Note: This will install the full PyTorch with CUDA support, which is a large download (~2GB).* |
| 47 | + |
| 48 | +## Data Format |
| 49 | + |
| 50 | +APA-Net expects input data in `.npy` format with the following structure: |
| 51 | +- **Shape**: `(n_samples, 9)` where each row represents one sample |
| 52 | +- **Columns**: |
| 53 | + - Column 0: Float value (sample ID/index) |
| 54 | + - Column 1: String (cell type name) |
| 55 | + - Column 2: String (additional metadata) |
| 56 | + - Column 3: Float value |
| 57 | + - Column 4: String (additional metadata) |
| 58 | + - Column 5: String (genomic coordinates/switch name) |
| 59 | + - Column 6: NumPy array of shape `(4, 4000)` - one-hot encoded DNA sequence |
| 60 | + - Column 7: Float (target APA usage value) |
| 61 | + - Column 8: NumPy array of shape `(327,)` - cell type profile features |
| 62 | + |
| 63 | +## Usage |
| 64 | + |
| 65 | +### Training the Model |
| 66 | + |
| 67 | +To train the APA-Net model, use the train_script.py script: |
| 68 | + |
| 69 | +```bash |
| 70 | +cd apamodel |
| 71 | +python train_script.py \ |
| 72 | + --train_data "/path/to/train_data.npy" \ |
| 73 | + --valid_data "/path/to/valid_data.npy" \ |
| 74 | + --modelfile "/path/to/model_output.pt" \ |
| 75 | + --batch_size 64 \ |
| 76 | + --epochs 200 \ |
| 77 | + --device "cpu" \ |
| 78 | + --use_wandb "False" |
| 79 | +``` |
| 80 | + |
| 81 | +### Testing the Model |
| 82 | + |
| 83 | +You can test the model with sample data: |
| 84 | + |
| 85 | +```bash |
| 86 | +# Create a simple test script |
| 87 | +python -c " |
| 88 | +import sys |
| 89 | +sys.path.append('./apamodel') |
| 90 | +from model import APANET, APAData |
| 91 | +import numpy as np |
| 92 | +import torch |
14 | 93 |
|
| 94 | +# Load your data |
| 95 | +data = np.load('your_data.npy', allow_pickle=True) |
| 96 | +
|
| 97 | +# Configure model (using CPU) |
| 98 | +config = { |
| 99 | + 'device': 'cpu', |
| 100 | + 'opt': 'Adam', |
| 101 | + 'loss': 'mse', |
| 102 | + 'lr': 2.5e-05, |
| 103 | + 'adam_weight_decay': 0.09, |
| 104 | + 'conv1kc': 128, |
| 105 | + 'conv1ks': 12, |
| 106 | + 'conv1st': 1, |
| 107 | + 'pool1ks': 16, |
| 108 | + 'pool1st': 16, |
| 109 | + 'cnvpdrop1': 0, |
| 110 | + 'Matt_heads': 8, |
| 111 | + 'Matt_drop': 0.2, |
| 112 | + 'fc1_dims': [8192, 4048, 1024, 512, 256], |
| 113 | + 'fc1_dropouts': [0.25, 0.25, 0.25, 0, 0], |
| 114 | + 'fc2_dims': [128, 32, 16, 1], |
| 115 | + 'fc2_dropouts': [0.2, 0.2, 0, 0], |
| 116 | + 'psa_query_dim': 128, |
| 117 | + 'psa_num_layers': 1, |
| 118 | + 'psa_nhead': 1, |
| 119 | + 'psa_dim_feedforward': 1024, |
| 120 | + 'psa_dropout': 0 |
| 121 | +} |
| 122 | +
|
| 123 | +# Create and test model |
| 124 | +model = APANET(config) |
| 125 | +model.compile() |
| 126 | +print('Model created successfully!') |
| 127 | +" |
15 | 128 | ``` |
16 | 129 |
|
17 | | -# Usage |
| 130 | +## Command Line Arguments |
| 131 | + |
| 132 | +- `--train_data`: Path to the training data file (required) |
| 133 | +- `--valid_data`: Path to the validation data file (required) |
| 134 | +- `--modelfile`: Path where the trained model will be saved (required) |
| 135 | +- `--batch_size`: Batch size for training (default: 64) |
| 136 | +- `--epochs`: Number of training epochs (default: 200) |
| 137 | +- `--project_name`: Name of the project for wandb logging (default: "APA-Net_Training") |
| 138 | +- `--device`: Device to run the training on - use "cpu" or "cuda:0" (default: "cuda:0") |
| 139 | +- `--use_wandb`: Enable wandb logging - "True" or "False" (default: "True") |
| 140 | + |
| 141 | +## Model Architecture |
| 142 | + |
| 143 | +APA-Net is a deep neural network that combines: |
| 144 | +- **Convolutional layers** for sequence feature extraction |
| 145 | +- **Self-attention mechanism** for capturing long-range dependencies |
| 146 | +- **Fully connected layers** for prediction |
| 147 | +- **Cell type profile integration** for context-specific modeling |
18 | 148 |
|
19 | | -To train the APA-Net model, use the train_script.py script with the necessary command-line arguments: |
| 149 | +The model has approximately 301M parameters and processes: |
| 150 | +- Input: DNA sequences (4×4000) + cell type profiles (327 features) |
| 151 | +- Output: APA usage prediction (single value) |
| 152 | + |
| 153 | +## Troubleshooting |
| 154 | + |
| 155 | +### Common Issues |
| 156 | + |
| 157 | +1. **CUDA errors**: If you encounter CUDA-related errors, install the CPU-only version of PyTorch: |
| 158 | + ```bash |
| 159 | + pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu |
| 160 | + ``` |
| 161 | + |
| 162 | +2. **Memory issues**: Reduce batch size if you encounter out-of-memory errors: |
| 163 | + ```bash |
| 164 | + --batch_size 32 |
| 165 | + ``` |
| 166 | + |
| 167 | +3. **Data format errors**: Ensure your data has the correct shape `(n_samples, 9)` with sequences of shape `(4, 4000)` and cell type profiles of shape `(327,)`. |
| 168 | + |
| 169 | +### CPU vs GPU Usage |
| 170 | + |
| 171 | +- **CPU**: Slower but more compatible. Use `--device "cpu"` |
| 172 | +- **GPU**: Faster training. Use `--device "cuda:0"` (requires CUDA-compatible PyTorch installation) |
| 173 | + |
| 174 | +## Example |
| 175 | + |
| 176 | +Here's a complete example of training APA-Net: |
20 | 177 |
|
21 | 178 | ```bash |
| 179 | +# Navigate to the model directory |
| 180 | +cd APA-Net/apamodel |
| 181 | + |
| 182 | +# Train the model |
22 | 183 | python train_script.py \ |
23 | | ---train_data "/path/to/train_data.npy" \ |
24 | | ---train_seq "/path/to/train_seq.npy" \ |
25 | | ---valid_data "/path/to/valid_data.npy" \ |
26 | | ---valid_seq "/path/to/valid_seq.npy" \ |
27 | | ---profiles "/path/to/celltype_profiles.tsv" \ |
28 | | ---modelfile "/path/to/model_output.pt" \ |
29 | | ---batch_size 64 \ |
30 | | ---epochs 200 \ |
31 | | ---project_name "APA-Net_Training" \ |
32 | | ---device "cuda:1" \ |
33 | | ---use_wandb "True" |
34 | | -``` |
35 | | - |
36 | | -# Arguments |
37 | | -- `--train_data`: Path to the training data file. |
38 | | -- `--train_seq`: Path to the training sequence data file. |
39 | | -- `--valid_data`: Path to the validation data file. |
40 | | -- `--valid_seq`: Path to the validation sequence data file. |
41 | | -- `--profiles`: Path to the cell type profiles file. |
42 | | -- `--modelfile`: Path where the trained model will be saved. |
43 | | -- `--batch_size`: Batch size for training (default: 64). |
44 | | -- `--epochs`: Number of training epochs (default: 200). |
45 | | -- `--project_name`: Name of the project for wandb logging. |
46 | | -- `--device`: Device to run the training on (e.g., 'cuda:1'). |
47 | | -- `--use_wandb`: Flag to enable or disable wandb logging ('True' or 'False'). |
| 184 | + --train_data "../test_fold_0.npy" \ |
| 185 | + --valid_data "../test_fold_0.npy" \ |
| 186 | + --modelfile "./trained_model.pt" \ |
| 187 | + --batch_size 32 \ |
| 188 | + --epochs 50 \ |
| 189 | + --device "cpu" \ |
| 190 | + --use_wandb "False" \ |
| 191 | + --project_name "APA-Net_Test" |
| 192 | +``` |
| 193 | + |
| 194 | +## Analysis and Figures |
| 195 | + |
| 196 | +The `analysis_and_figures/` directory contains all the code and notebooks used to reproduce the results and figures from our APA-Net research paper. This comprehensive analysis pipeline covers data processing, model evaluation, comparative analysis, and visualization. |
| 197 | + |
| 198 | +### Directory Structure |
| 199 | + |
| 200 | +``` |
| 201 | +analysis_and_figures/ |
| 202 | +├── model_performance/ # APA-Net model evaluation and performance analysis |
| 203 | +├── data_processing/ # Data preparation and preprocessing for APA-Net |
| 204 | +├── comparative_analysis/ # Comparative studies (APA vs DE, correlations) |
| 205 | +├── visualization/ # Figure generation and plotting scripts |
| 206 | +├── gene_expression/ # Differential gene expression analysis |
| 207 | +├── pathway_analysis/ # Gene set enrichment and pathway analysis |
| 208 | +├── preprocessing/ # Single-cell RNA-seq data preprocessing pipeline |
| 209 | +└── functions/ # Utility functions and helper scripts |
| 210 | +``` |
| 211 | + |
| 212 | +### Getting Started with Analysis |
| 213 | + |
| 214 | +1. **Prerequisites**: Make sure you have the following R and Python packages installed: |
| 215 | + |
| 216 | +**R packages:** |
| 217 | +```r |
| 218 | +install.packages(c("dplyr", "ggplot2", "tidyr", "viridis", "patchwork", |
| 219 | + "readxl", "gridExtra", "ggpubr", "ggrepel", "reshape2", |
| 220 | + "corrplot", "pheatmap", "boot", "Seurat", "scCustomize")) |
| 221 | +``` |
| 222 | + |
| 223 | +**Python packages:** |
| 224 | +```bash |
| 225 | +pip install pandas numpy scipy matplotlib seaborn scikit-learn statsmodels |
| 226 | +``` |
| 227 | + |
| 228 | +2. **Data Requirements**: The analysis scripts expect data in specific locations. You may need to adjust file paths in the notebooks to match your data directory structure. |
| 229 | + |
| 230 | +### Analysis Modules |
| 231 | + |
| 232 | +#### 1. Model Performance (`model_performance/`) |
| 233 | +- **APA-NET_performance_plots.ipynb**: Generates correlation plots showing model performance across cell types |
| 234 | +- **APA-Net_filter_interactions.ipynb**: Analyzes convolutional filter interactions and RBP binding patterns |
| 235 | +- **APA-Net_heatmap_for_filter_interactions.ipynb**: Creates heatmaps showing filter-RBP interactions |
| 236 | + |
| 237 | +#### 2. Data Processing (`data_processing/`) |
| 238 | +- **Process_inputs_for_APA-Net.ipynb**: Main data preprocessing pipeline for APA-Net training data |
| 239 | + - Processes RNA sequences and APA usage data |
| 240 | + - Generates one-hot encoded sequences |
| 241 | + - Creates 5-fold cross-validation splits |
| 242 | + - Formats data for model training |
| 243 | +- **APA_quantification_maaper_apalog_Dec2024.ipynb**: APA event quantification using MAAPER |
| 244 | +- **emprical_fdr_thresholds_maaper_apalog.ipynb**: Determines empirical FDR thresholds for significance testing |
| 245 | + |
| 246 | +#### 3. Comparative Analysis (`comparative_analysis/`) |
| 247 | +- **APA_vs_DE.ipynb**: Compares APA changes with differential expression |
| 248 | + - Correlation analysis between APA usage and gene expression changes |
| 249 | + - Cell-type-specific comparisons |
| 250 | + - Statistical significance testing |
| 251 | +- **apa_correlation_across_celltypes.ipynb**: Cross-cell-type APA correlation analysis |
| 252 | +- **rbp_co_occurance_dissimilarity.ipynb**: RNA-binding protein co-occurrence analysis |
| 253 | + |
| 254 | +#### 4. Visualization (`visualization/`) |
| 255 | +- **maaper_volcanos_barplots_figure6.ipynb**: Creates volcano plots and bar plots for Figure 6 |
| 256 | + - APA usage changes across conditions |
| 257 | + - Cell-type-specific visualizations |
| 258 | + - Statistical significance visualization |
| 259 | + |
| 260 | +#### 5. Gene Expression (`gene_expression/`) |
| 261 | +- **DEG_ALS_genes.R**: Analysis of ALS-associated gene expression |
| 262 | +- **DEG_MAST_analysis.R**: MAST-based differential expression analysis |
| 263 | +- **DEG_pathway_analysis.R**: Pathway enrichment analysis for DEGs |
| 264 | +- **DEG_visualization.R**: Visualization of differential expression results |
| 265 | + |
| 266 | +#### 6. Pathway Analysis (`pathway_analysis/`) |
| 267 | +- **APA_pathway_analysis.R**: Gene set enrichment analysis for APA-affected genes |
| 268 | + - GO term enrichment |
| 269 | + - Reactome pathway analysis |
| 270 | + - Custom gene set analysis |
| 271 | + |
| 272 | +#### 7. Preprocessing (`preprocessing/`) |
| 273 | +- **processing_annotation/**: Single-cell RNA-seq processing pipeline |
| 274 | + - `01_snRNA_cellranger_preprocess.sh`: Cell Ranger preprocessing |
| 275 | + - `02_snRNA_process_QC.R`: Quality control and filtering |
| 276 | + - `03_snRNA_clustering_annotation.R`: Cell clustering and annotation |
| 277 | + - `04a_snRNA_NSForest1.ipynb` & `04b_snRNA_NSForest2.ipynb`: NSForest cell type classification |
| 278 | +- **independent_datasets/**: Processing of additional validation datasets |
| 279 | + - `01_read_matrices.R`: Matrix reading and preprocessing |
| 280 | + - `02_harmony_int.R`: Harmony integration for batch correction |
| 281 | + - `03_doublet_removal_annotation.R`: Doublet detection and removal |
| 282 | + |
| 283 | +### Reproducing Key Results |
| 284 | + |
| 285 | +#### Figure Generation |
| 286 | +To reproduce the main figures from the paper: |
| 287 | + |
| 288 | +1. **Model Performance Plots**: |
| 289 | + ```bash |
| 290 | + cd analysis_and_figures/model_performance |
| 291 | + jupyter notebook APA-NET_performance_plots.ipynb |
| 292 | + ``` |
| 293 | + |
| 294 | +2. **APA Usage Analysis**: |
| 295 | + ```bash |
| 296 | + cd analysis_and_figures/visualization |
| 297 | + jupyter notebook maaper_volcanos_barplots_figure6.ipynb |
| 298 | + ``` |
| 299 | + |
| 300 | +3. **Comparative Analysis**: |
| 301 | + ```bash |
| 302 | + cd analysis_and_figures/comparative_analysis |
| 303 | + jupyter notebook APA_vs_DE.ipynb |
| 304 | + ``` |
| 305 | + |
| 306 | +#### Data Processing Pipeline |
| 307 | +To process your own data through the complete pipeline: |
| 308 | + |
| 309 | +1. **Start with raw single-cell data**: |
| 310 | + ```bash |
| 311 | + cd analysis_and_figures/preprocessing/processing_annotation |
| 312 | + bash 01_snRNA_cellranger_preprocess.sh |
| 313 | + ``` |
| 314 | + |
| 315 | +2. **Process and prepare for APA-Net**: |
| 316 | + ```bash |
| 317 | + cd analysis_and_figures/data_processing |
| 318 | + jupyter notebook Process_inputs_for_APA-Net.ipynb |
| 319 | + ``` |
| 320 | + |
| 321 | +### Key Results and Interpretations |
| 322 | + |
| 323 | +- **Model Performance**: APA-Net achieves correlation coefficients of 0.56-0.67 across cell types |
| 324 | +- **Cell-Type Specificity**: Microglia show highest model performance, indicating stronger APA regulatory patterns |
| 325 | +- **Condition Comparison**: Strong correlations (0.65-0.84) between C9ALS and sALS APA changes across cell types |
| 326 | +- **Biological Validation**: APA changes correlate with known ALS pathways and RBP targets |
| 327 | + |
| 328 | +### Data Availability |
| 329 | + |
| 330 | +The analysis scripts reference several data sources: |
| 331 | +- Single-cell RNA-seq count matrices |
| 332 | +- APA usage quantification results |
| 333 | +- Cell type annotations |
| 334 | +- RBP expression profiles |
| 335 | +- Reference genome and annotations |
| 336 | + |
| 337 | +Please ensure you have access to the appropriate datasets before running the analysis scripts. |
| 338 | + |
| 339 | +### Citation |
| 340 | + |
| 341 | +If you use this analysis pipeline, please cite our paper: |
| 342 | +``` |
| 343 | +[Paper citation to be added upon publication] |
| 344 | +``` |
| 345 | + |
| 346 | +For questions about the analysis pipeline, please open an issue in the GitHub repository. |
48 | 347 |
|
0 commit comments