Skip to content

Commit ea317b8

Browse files
committed
add analysis and figures codes
1 parent e2ec6d8 commit ea317b8

32 files changed

+37038
-29
lines changed

README.md

Lines changed: 328 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,347 @@
11
# APA-Net
22

3-
APA-Net is a deep learning model designed for learning context specific APA usage. This guide covers the steps necessary to set up and run APA-Net.
3+
APA-Net is a deep learning model designed for learning context-specific APA (Alternative Polyadenylation) usage. This guide covers the steps necessary to set up and run APA-Net.
4+
5+
## Requirements
6+
7+
- Python 3.8 or higher
8+
- PyTorch 1.8.0 or higher
9+
- NumPy
10+
- Pandas
11+
- SciPy
12+
- tqdm
13+
- wandb (optional, for experiment tracking)
414

515
## Installation
616

7-
Before running APA-Net, ensure you have Python installed on your system. Clone this repository to your local machine:
17+
### Option 1: Install from source (Recommended)
818

19+
1. Clone this repository to your local machine:
920
```bash
1021
git clone https://github.com/BaderLab/APA-Net.git
1122
cd APA-Net
23+
```
24+
25+
2. Install dependencies manually for better control:
26+
```bash
27+
# For CPU-only version (smaller download)
28+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
29+
30+
# For GPU version (if you have CUDA)
31+
pip install torch torchvision torchaudio
1232

33+
# Install other dependencies
34+
pip install numpy pandas scipy tqdm wandb
35+
```
36+
37+
3. Install the package:
38+
```bash
1339
pip install .
40+
```
41+
42+
### Option 2: One-command installation
43+
```bash
44+
pip install .
45+
```
46+
*Note: This will install the full PyTorch with CUDA support, which is a large download (~2GB).*
47+
48+
## Data Format
49+
50+
APA-Net expects input data in `.npy` format with the following structure:
51+
- **Shape**: `(n_samples, 9)` where each row represents one sample
52+
- **Columns**:
53+
- Column 0: Float value (sample ID/index)
54+
- Column 1: String (cell type name)
55+
- Column 2: String (additional metadata)
56+
- Column 3: Float value
57+
- Column 4: String (additional metadata)
58+
- Column 5: String (genomic coordinates/switch name)
59+
- Column 6: NumPy array of shape `(4, 4000)` - one-hot encoded DNA sequence
60+
- Column 7: Float (target APA usage value)
61+
- Column 8: NumPy array of shape `(327,)` - cell type profile features
62+
63+
## Usage
64+
65+
### Training the Model
66+
67+
To train the APA-Net model, use the train_script.py script:
68+
69+
```bash
70+
cd apamodel
71+
python train_script.py \
72+
--train_data "/path/to/train_data.npy" \
73+
--valid_data "/path/to/valid_data.npy" \
74+
--modelfile "/path/to/model_output.pt" \
75+
--batch_size 64 \
76+
--epochs 200 \
77+
--device "cpu" \
78+
--use_wandb "False"
79+
```
80+
81+
### Testing the Model
82+
83+
You can test the model with sample data:
84+
85+
```bash
86+
# Create a simple test script
87+
python -c "
88+
import sys
89+
sys.path.append('./apamodel')
90+
from model import APANET, APAData
91+
import numpy as np
92+
import torch
1493
94+
# Load your data
95+
data = np.load('your_data.npy', allow_pickle=True)
96+
97+
# Configure model (using CPU)
98+
config = {
99+
'device': 'cpu',
100+
'opt': 'Adam',
101+
'loss': 'mse',
102+
'lr': 2.5e-05,
103+
'adam_weight_decay': 0.09,
104+
'conv1kc': 128,
105+
'conv1ks': 12,
106+
'conv1st': 1,
107+
'pool1ks': 16,
108+
'pool1st': 16,
109+
'cnvpdrop1': 0,
110+
'Matt_heads': 8,
111+
'Matt_drop': 0.2,
112+
'fc1_dims': [8192, 4048, 1024, 512, 256],
113+
'fc1_dropouts': [0.25, 0.25, 0.25, 0, 0],
114+
'fc2_dims': [128, 32, 16, 1],
115+
'fc2_dropouts': [0.2, 0.2, 0, 0],
116+
'psa_query_dim': 128,
117+
'psa_num_layers': 1,
118+
'psa_nhead': 1,
119+
'psa_dim_feedforward': 1024,
120+
'psa_dropout': 0
121+
}
122+
123+
# Create and test model
124+
model = APANET(config)
125+
model.compile()
126+
print('Model created successfully!')
127+
"
15128
```
16129

17-
# Usage
130+
## Command Line Arguments
131+
132+
- `--train_data`: Path to the training data file (required)
133+
- `--valid_data`: Path to the validation data file (required)
134+
- `--modelfile`: Path where the trained model will be saved (required)
135+
- `--batch_size`: Batch size for training (default: 64)
136+
- `--epochs`: Number of training epochs (default: 200)
137+
- `--project_name`: Name of the project for wandb logging (default: "APA-Net_Training")
138+
- `--device`: Device to run the training on - use "cpu" or "cuda:0" (default: "cuda:0")
139+
- `--use_wandb`: Enable wandb logging - "True" or "False" (default: "True")
140+
141+
## Model Architecture
142+
143+
APA-Net is a deep neural network that combines:
144+
- **Convolutional layers** for sequence feature extraction
145+
- **Self-attention mechanism** for capturing long-range dependencies
146+
- **Fully connected layers** for prediction
147+
- **Cell type profile integration** for context-specific modeling
18148

19-
To train the APA-Net model, use the train_script.py script with the necessary command-line arguments:
149+
The model has approximately 301M parameters and processes:
150+
- Input: DNA sequences (4×4000) + cell type profiles (327 features)
151+
- Output: APA usage prediction (single value)
152+
153+
## Troubleshooting
154+
155+
### Common Issues
156+
157+
1. **CUDA errors**: If you encounter CUDA-related errors, install the CPU-only version of PyTorch:
158+
```bash
159+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
160+
```
161+
162+
2. **Memory issues**: Reduce batch size if you encounter out-of-memory errors:
163+
```bash
164+
--batch_size 32
165+
```
166+
167+
3. **Data format errors**: Ensure your data has the correct shape `(n_samples, 9)` with sequences of shape `(4, 4000)` and cell type profiles of shape `(327,)`.
168+
169+
### CPU vs GPU Usage
170+
171+
- **CPU**: Slower but more compatible. Use `--device "cpu"`
172+
- **GPU**: Faster training. Use `--device "cuda:0"` (requires CUDA-compatible PyTorch installation)
173+
174+
## Example
175+
176+
Here's a complete example of training APA-Net:
20177

21178
```bash
179+
# Navigate to the model directory
180+
cd APA-Net/apamodel
181+
182+
# Train the model
22183
python train_script.py \
23-
--train_data "/path/to/train_data.npy" \
24-
--train_seq "/path/to/train_seq.npy" \
25-
--valid_data "/path/to/valid_data.npy" \
26-
--valid_seq "/path/to/valid_seq.npy" \
27-
--profiles "/path/to/celltype_profiles.tsv" \
28-
--modelfile "/path/to/model_output.pt" \
29-
--batch_size 64 \
30-
--epochs 200 \
31-
--project_name "APA-Net_Training" \
32-
--device "cuda:1" \
33-
--use_wandb "True"
34-
```
35-
36-
# Arguments
37-
- `--train_data`: Path to the training data file.
38-
- `--train_seq`: Path to the training sequence data file.
39-
- `--valid_data`: Path to the validation data file.
40-
- `--valid_seq`: Path to the validation sequence data file.
41-
- `--profiles`: Path to the cell type profiles file.
42-
- `--modelfile`: Path where the trained model will be saved.
43-
- `--batch_size`: Batch size for training (default: 64).
44-
- `--epochs`: Number of training epochs (default: 200).
45-
- `--project_name`: Name of the project for wandb logging.
46-
- `--device`: Device to run the training on (e.g., 'cuda:1').
47-
- `--use_wandb`: Flag to enable or disable wandb logging ('True' or 'False').
184+
--train_data "../test_fold_0.npy" \
185+
--valid_data "../test_fold_0.npy" \
186+
--modelfile "./trained_model.pt" \
187+
--batch_size 32 \
188+
--epochs 50 \
189+
--device "cpu" \
190+
--use_wandb "False" \
191+
--project_name "APA-Net_Test"
192+
```
193+
194+
## Analysis and Figures
195+
196+
The `analysis_and_figures/` directory contains all the code and notebooks used to reproduce the results and figures from our APA-Net research paper. This comprehensive analysis pipeline covers data processing, model evaluation, comparative analysis, and visualization.
197+
198+
### Directory Structure
199+
200+
```
201+
analysis_and_figures/
202+
├── model_performance/ # APA-Net model evaluation and performance analysis
203+
├── data_processing/ # Data preparation and preprocessing for APA-Net
204+
├── comparative_analysis/ # Comparative studies (APA vs DE, correlations)
205+
├── visualization/ # Figure generation and plotting scripts
206+
├── gene_expression/ # Differential gene expression analysis
207+
├── pathway_analysis/ # Gene set enrichment and pathway analysis
208+
├── preprocessing/ # Single-cell RNA-seq data preprocessing pipeline
209+
└── functions/ # Utility functions and helper scripts
210+
```
211+
212+
### Getting Started with Analysis
213+
214+
1. **Prerequisites**: Make sure you have the following R and Python packages installed:
215+
216+
**R packages:**
217+
```r
218+
install.packages(c("dplyr", "ggplot2", "tidyr", "viridis", "patchwork",
219+
"readxl", "gridExtra", "ggpubr", "ggrepel", "reshape2",
220+
"corrplot", "pheatmap", "boot", "Seurat", "scCustomize"))
221+
```
222+
223+
**Python packages:**
224+
```bash
225+
pip install pandas numpy scipy matplotlib seaborn scikit-learn statsmodels
226+
```
227+
228+
2. **Data Requirements**: The analysis scripts expect data in specific locations. You may need to adjust file paths in the notebooks to match your data directory structure.
229+
230+
### Analysis Modules
231+
232+
#### 1. Model Performance (`model_performance/`)
233+
- **APA-NET_performance_plots.ipynb**: Generates correlation plots showing model performance across cell types
234+
- **APA-Net_filter_interactions.ipynb**: Analyzes convolutional filter interactions and RBP binding patterns
235+
- **APA-Net_heatmap_for_filter_interactions.ipynb**: Creates heatmaps showing filter-RBP interactions
236+
237+
#### 2. Data Processing (`data_processing/`)
238+
- **Process_inputs_for_APA-Net.ipynb**: Main data preprocessing pipeline for APA-Net training data
239+
- Processes RNA sequences and APA usage data
240+
- Generates one-hot encoded sequences
241+
- Creates 5-fold cross-validation splits
242+
- Formats data for model training
243+
- **APA_quantification_maaper_apalog_Dec2024.ipynb**: APA event quantification using MAAPER
244+
- **emprical_fdr_thresholds_maaper_apalog.ipynb**: Determines empirical FDR thresholds for significance testing
245+
246+
#### 3. Comparative Analysis (`comparative_analysis/`)
247+
- **APA_vs_DE.ipynb**: Compares APA changes with differential expression
248+
- Correlation analysis between APA usage and gene expression changes
249+
- Cell-type-specific comparisons
250+
- Statistical significance testing
251+
- **apa_correlation_across_celltypes.ipynb**: Cross-cell-type APA correlation analysis
252+
- **rbp_co_occurance_dissimilarity.ipynb**: RNA-binding protein co-occurrence analysis
253+
254+
#### 4. Visualization (`visualization/`)
255+
- **maaper_volcanos_barplots_figure6.ipynb**: Creates volcano plots and bar plots for Figure 6
256+
- APA usage changes across conditions
257+
- Cell-type-specific visualizations
258+
- Statistical significance visualization
259+
260+
#### 5. Gene Expression (`gene_expression/`)
261+
- **DEG_ALS_genes.R**: Analysis of ALS-associated gene expression
262+
- **DEG_MAST_analysis.R**: MAST-based differential expression analysis
263+
- **DEG_pathway_analysis.R**: Pathway enrichment analysis for DEGs
264+
- **DEG_visualization.R**: Visualization of differential expression results
265+
266+
#### 6. Pathway Analysis (`pathway_analysis/`)
267+
- **APA_pathway_analysis.R**: Gene set enrichment analysis for APA-affected genes
268+
- GO term enrichment
269+
- Reactome pathway analysis
270+
- Custom gene set analysis
271+
272+
#### 7. Preprocessing (`preprocessing/`)
273+
- **processing_annotation/**: Single-cell RNA-seq processing pipeline
274+
- `01_snRNA_cellranger_preprocess.sh`: Cell Ranger preprocessing
275+
- `02_snRNA_process_QC.R`: Quality control and filtering
276+
- `03_snRNA_clustering_annotation.R`: Cell clustering and annotation
277+
- `04a_snRNA_NSForest1.ipynb` & `04b_snRNA_NSForest2.ipynb`: NSForest cell type classification
278+
- **independent_datasets/**: Processing of additional validation datasets
279+
- `01_read_matrices.R`: Matrix reading and preprocessing
280+
- `02_harmony_int.R`: Harmony integration for batch correction
281+
- `03_doublet_removal_annotation.R`: Doublet detection and removal
282+
283+
### Reproducing Key Results
284+
285+
#### Figure Generation
286+
To reproduce the main figures from the paper:
287+
288+
1. **Model Performance Plots**:
289+
```bash
290+
cd analysis_and_figures/model_performance
291+
jupyter notebook APA-NET_performance_plots.ipynb
292+
```
293+
294+
2. **APA Usage Analysis**:
295+
```bash
296+
cd analysis_and_figures/visualization
297+
jupyter notebook maaper_volcanos_barplots_figure6.ipynb
298+
```
299+
300+
3. **Comparative Analysis**:
301+
```bash
302+
cd analysis_and_figures/comparative_analysis
303+
jupyter notebook APA_vs_DE.ipynb
304+
```
305+
306+
#### Data Processing Pipeline
307+
To process your own data through the complete pipeline:
308+
309+
1. **Start with raw single-cell data**:
310+
```bash
311+
cd analysis_and_figures/preprocessing/processing_annotation
312+
bash 01_snRNA_cellranger_preprocess.sh
313+
```
314+
315+
2. **Process and prepare for APA-Net**:
316+
```bash
317+
cd analysis_and_figures/data_processing
318+
jupyter notebook Process_inputs_for_APA-Net.ipynb
319+
```
320+
321+
### Key Results and Interpretations
322+
323+
- **Model Performance**: APA-Net achieves correlation coefficients of 0.56-0.67 across cell types
324+
- **Cell-Type Specificity**: Microglia show highest model performance, indicating stronger APA regulatory patterns
325+
- **Condition Comparison**: Strong correlations (0.65-0.84) between C9ALS and sALS APA changes across cell types
326+
- **Biological Validation**: APA changes correlate with known ALS pathways and RBP targets
327+
328+
### Data Availability
329+
330+
The analysis scripts reference several data sources:
331+
- Single-cell RNA-seq count matrices
332+
- APA usage quantification results
333+
- Cell type annotations
334+
- RBP expression profiles
335+
- Reference genome and annotations
336+
337+
Please ensure you have access to the appropriate datasets before running the analysis scripts.
338+
339+
### Citation
340+
341+
If you use this analysis pipeline, please cite our paper:
342+
```
343+
[Paper citation to be added upon publication]
344+
```
345+
346+
For questions about the analysis pipeline, please open an issue in the GitHub repository.
48347

analysis_and_figures/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 Aiden M Sababi
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

0 commit comments

Comments
 (0)