Boyi Wei1,2*† ,
Zora Che1,3*† ,
Nathaniel Li1† ,
Jasper Götting 4 ,
Samira Nedungadi4 ,
Julian Michael1† ,
Summer Yue1† ,
Dan Hendrycks5 ,
Peter Henderson2 ,
Zifan Wang1† ,
Seth Donoughe4 ,
Mantas Mazeika5
*Equal Contribution †Work done while at Scale AI
1Scale AI 2Princeton University 3University of Maryland 4SecureBio 5Center for AI Safety
This repository is built on the top of the BioNeMo Framework. With two additional modules:
bioriskeval/
: An evaluation framework for assessing the dual-use risk of bio-foundation models.attack/
: Contains scripts for fine-tuning. We included the scripts for probing inbioriskeval/vir
andbioriskeval/mut
.
Following the same steps in Getting Started with BioNeMo Framework, you can run the following scrips to clone the reposiotry and build the docker image:
git clone --recursive [email protected]:boyiwei/BioRiskEval.git
cd BioRiskEval
Use the following script to download the BioRiskEval dataset:
cd bioriskeval
bash download_data.sh
You may need to first get access to the huggingface dataset before downloading. The script will download BioRiskEval-Gen into bioriskeval/gen/data/
, BioRiskEval-Mut into bioriskeval/mut/data/
, and BioRiskEval-Vir into bioriskeval/vir/data/
. For BioRiskEval-Mut, we have two sets of data: DMS_ProteinGym_substitutions
and DMS_Probe
. DMS_ProteinGym_substitutions
contains 16 DMS datasets collected from ProteinGym and is used for log-likelihood based evaluation. DMS_Probe
is the dataset used for probe based evaluation. You can also generate DMS_Probe
by running dms/probe/create_dms_probe_dataset.py
.
With a locally cloned repository and initialized submodules, build the container using:
docker buildx build . -t my-container-tag
We distribute a development container configuration for vscode
(.devcontainer/devcontainer.json
) that simplifies the process of local testing and development. Opening the
BioRiskEval folder with VSCode should prompt you to re-open the folder inside the devcontainer environment.
Note
The first time you launch the devcontainer, it may take a long time to build the image. Building the image locally (using the command shown above) will ensure that most of the layers are present in the local docker cache.
We highly recommend to run the experiments on H100 GPUs.
After installation, BioNemo Framework needs to first convert the Evo2-Vortex checkpoint to NeMo2 checkpoint. This can be done by running the following script:
7b-1M
evo2_convert_to_nemo2 \
--model-path hf://arcinstitute/savanna_evo2_7b \
--model-size 7b_arc_longcontext --output-dir /your/checkpoint/dir/nemo2_evo2_7b_1m
40b-1M
evo2_convert_to_nemo2 \
--model-path hf://arcinstitute/savanna_evo2_40b \
--model-size 40b_arc_longcontext --output-dir /your/checkpoint/dir/nemo2_evo2_40b_1m
The hierarchy of BioRiskEval is:
- BioRiskEval-Gen (
bioriskeval/gen
): Sequnece modeling evaluation. Metric: Perplexity - BioRiskEval-Mut (
bioriskeval/mut
): Mutational effect prediction evaluation. Metric: |Spearman correlation$\rho$ | - BioRiskEval-Vir (
bioriskeval/vir
): Virulence prediction evaluation. Metric: Pearson correlation,$R^2$
The workflow of BioRiskEval-Gen is:
- Sample examples from
human_host_df.csv
, specify the species name, genus name, or family name. - The script will gather the sampled accession ids and download the sequences from NCBI.
eval_ppl.py
will compute the perplexity on the downloaded sequences, and save the results inresults/
.
We have provided an example script bioriskeval/gen/bioriskeval_gen.sh
for quick start.
The workflow of BioRiskEval-Mut under the zero-shot/loglikelihood setting is:
- Process protein sequences for DMS (Deep Mutational Scanning) into nucleotides with
nucleotide_data_pipeline.py
eval_fitness.py
calculates log-likelihood based score for auto-regressive genomic models on mutational sequences, and Spearman correlation with the ground truth experimental fitness is reported for each DMS.eval_fitness_esm2.py
calculates scoring with masked marginals for ESM2 protein models.
We provide an example script bioriskeval/mut/bioriskeval_mut_logprob.sh
for quick start.
The workflow of BioRiskEval-Mut under the probe setting is:
- Pick
$k$ numbers of mutations from each DMS to fit linear probes. Within the k mutations, 80% are used to fit and 20% are used as the validation split. Rest of the data is used as test split.create_dms_probe_dataset.py
create the splits and saves representations for train and val splits. - Sweep over all layers with
sweep_dms_probe.py
to find the best layer for fitting the linear probe. Best probe based on train RMSE or validation split spearman are saved. - Save test representation based on the best layer with
probe_layer_utils.py
andcreate_dms_probe_dataset.py
. Evaluate saved probes withtest_dms_probe.py
We provide an example script bioriskeval/mut/bioriskeval_mut_probe.sh
for quick start.
The workflow of BioRiskEval-Vir is:
- Extract hidden-layer representations, create train-test split (7:3) for probing.
- Train a linear probe on the train set, and evaluate on the test set.
train_probe_continuous.py
will train the linear probe and evaluate its performance on the test set. It will also uplaod the results to Weights & Biases and dumpe the results to a csv file.
We have provided an example script bioriskeval/vir/bioriskeval_vir.sh
for quick start.
Inside attack/
, we have the scripts for fine-tuning.
The workflow of fine-tuning is:
- Have the csv file with accession ids in column
#Accession
- Convert the csv file to fna file using
convert_csv_to_fna.py
- Create the train-val split, tokenize the data
- Create dataset config for fine-tuning
- Fine-tune the model
We provide an example script in attack/data/preprocess_ft_data.sh
(preprocess data, step 1-4) and attack/ft/launch_ft_7b_1m.sh
(fine-tuning the model, step 5) for quick start, in which you can modify the csv file path and change the preprocess config.
The workflow of probing is:
- Extract the hidden-layer representations from the model
- Train a linear probe on the train set
- Evaluate the probe on the test set
Refer to the example scripts bioriskeval/vir/bioriskeval_vir.sh
for quick start.
We documented the results in attack/analysis/
, which contains the raw results and scripts for analysis and plotting.