CNN ensemble model for protein activity prediction using ESM-2 embeddings.
- Python 3.12.1
- pip
Note: All versions of Python and packages are pinned to the exact versions used during analysis for reproducibility.
- Clone the repository:
git clone https://github.com/Arcadia-Science/2025-GFP-variant-design.git
cd 2025-GFP-variant-design- Install dependencies:
pip install -e .- Download the ESM-2 embeddings from https://zenodo.org/records/17088257, and place the file in the
datadirectory.
Trains a CNN ensemble model on protein sequence embeddings and activity scores.
Generates protein sequence variants and predicts their activities using the trained model.
Creates visualizations from the results.
Extracts fluorescence readings from raw plate reader data found in the experimental_data directory, analyzes the data, and creates visualizations for the protein variants we analyzed in the lab.
Generates figures showing dual-channel microscopy images of E. coli expressing GFP variants, providing a representative view of fluorescence properties within the broader dataset.
Generates embeddings and metadata from seq_and_score.csv with progress tracking.
Run notebooks in order: 01_train_model.ipynb → 02_generate_sequences.ipynb → 03_create_figures.ipynb
Converts seq_and_score.csv to full embeddings dataset with ESM-2 embeddings and metadata columns.
Some systems need certificates for embedding generation:
apt update && apt install -y ca-certificates && update-ca-certificatesUsage:
python create_embeddings_from_seq_score.pyPerformance Notes:
- Original embeddings (~50k sequences) were generated on an H100 GPU in ~9 hours
- Uses GPU-to-CPU memory offloading, allowing the ESM-2 15B model to run on GPUs with limited VRAM by storing model parameters in CPU memory and transferring them to GPU as needed during computation
- Can run on smaller GPUs (e.g., A10G ~35 hours) but will take significantly longer
- Requires significant computational resources and CUDA-compatible GPU