This project demonstrates how to use diffusion models for generating nanobody (single-domain antibody) amino acid sequences. It is designed as a research resource, providing a full pipeline from data preparation to model training, prediction, and visualization. Key features in this project include:
- Preparing and processing amino acid sequence data for machine learning
- Training a diffusion model to generate nanobody sequence generation
- Create new nanobody amino acid sequences from random data using model
- Scripts to visualize sequence data and model outputs
AlphaFold predicted structure of generated nanobody sequence
Amino acid frequency comparison between generated and test sequences
Comprehensive frequency analysis across all amino acids
The easiest way to get started is to follow these steps:
-
Clone this project
git clone https://github.com/nathangendler/Nanobody_Diffusion cd Nanobody_Diffusion -
Set up your Python environment
-
Install dependencies
- Dependencies can be found in the requirements.txt file
pip install -r requirements.txt
- Dependencies can be found in the requirements.txt file
-
Prepare your data
- Place your amino acid sequence data (e.g., FASTA files) in the
data/directory and edit the path in diffusion/diffusion_train.py
- Place your amino acid sequence data (e.g., FASTA files) in the
-
Train the diffusion model
- Run the training script:
python diffusion/diffusion_train.py
- Run the training script:
-
Predict new sequences
- Use the prediction script:
python diffusion/diffusion_predict.py
- Use the prediction script:
-
Visualize results
- Explore the
output/anddiffusion/visual/directories for visualization tools and results.
- Explore the
This project is intended as a starting point for amino acid sequence generation research. You are encouraged to fork, clone, and modify the codebase for your own experiments. The modular structure makes it easy to:
- Adjust model architectures in
diffusion/src/ - Extend training and prediction scripts
- Integrate new visualization tools