Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei^1,2*† , Zora Che^1,3*† , Nathaniel Li^1† , Jasper Götting ⁴ , Samira Nedungadi⁴ , Julian Michael^1† , Summer Yue^1† , Dan Hendrycks⁵ , Peter Henderson² , Zifan Wang^1† , Seth Donoughe⁴ , Mantas Mazeika⁵
^*Equal Contribution    ^†Work done while at Scale AI
¹Scale AI    ²Princeton University    ³University of Maryland    ⁴SecureBio    ⁵Center for AI Safety

Paper | Blogpost | Twitter

This repository is built on the top of the BioNeMo Framework. With two additional modules:

bioriskeval/: An evaluation framework for assessing the dual-use risk of bio-foundation models.
attack/: Contains scripts for fine-tuning. We included the scripts for probing in bioriskeval/vir and bioriskeval/mut.

Installation

Build the docker image and develop environment

Following the same steps in Getting Started with BioNeMo Framework, you can run the following scrips to clone the reposiotry and build the docker image:

Download the repository

git clone --recursive [email protected]:boyiwei/BioRiskEval.git
cd BioRiskEval

Download Dataset

Use the following script to download the BioRiskEval dataset:

cd bioriskeval
bash download_data.sh

You may need to first get access to the huggingface dataset before downloading. The script will download BioRiskEval-Gen into bioriskeval/gen/data/, BioRiskEval-Mut into bioriskeval/mut/data/, and BioRiskEval-Vir into bioriskeval/vir/data/. For BioRiskEval-Mut, we have two sets of data: DMS_ProteinGym_substitutions and DMS_Probe. DMS_ProteinGym_substitutions contains 16 DMS datasets collected from ProteinGym and is used for log-likelihood based evaluation. DMS_Probe is the dataset used for probe based evaluation. You can also generate DMS_Probe by running dms/probe/create_dms_probe_dataset.py.

Build the docker image

With a locally cloned repository and initialized submodules, build the container using:

docker buildx build . -t my-container-tag

VSCode Devcontainer for Interactive Debugging

We distribute a development container configuration for vscode (.devcontainer/devcontainer.json) that simplifies the process of local testing and development. Opening the BioRiskEval folder with VSCode should prompt you to re-open the folder inside the devcontainer environment.

Note

The first time you launch the devcontainer, it may take a long time to build the image. Building the image locally (using the command shown above) will ensure that most of the layers are present in the local docker cache.

We highly recommend to run the experiments on H100 GPUs.

Convert Checkpoints

After installation, BioNemo Framework needs to first convert the Evo2-Vortex checkpoint to NeMo2 checkpoint. This can be done by running the following script:

7b-1M

evo2_convert_to_nemo2 \
  --model-path hf://arcinstitute/savanna_evo2_7b \
  --model-size 7b_arc_longcontext --output-dir /your/checkpoint/dir/nemo2_evo2_7b_1m

40b-1M

evo2_convert_to_nemo2 \
  --model-path hf://arcinstitute/savanna_evo2_40b \
  --model-size 40b_arc_longcontext --output-dir /your/checkpoint/dir/nemo2_evo2_40b_1m

BioRiskEval

The hierarchy of BioRiskEval is:

BioRiskEval-Gen (bioriskeval/gen): Sequnece modeling evaluation. Metric: Perplexity
BioRiskEval-Mut (bioriskeval/mut): Mutational effect prediction evaluation. Metric: |Spearman correlation $\rho$|
BioRiskEval-Vir (bioriskeval/vir): Virulence prediction evaluation. Metric: Pearson correlation, $R^2$

BioRiskEval-Gen

The workflow of BioRiskEval-Gen is:

Sample examples from human_host_df.csv, specify the species name, genus name, or family name.
The script will gather the sampled accession ids and download the sequences from NCBI.
eval_ppl.py will compute the perplexity on the downloaded sequences, and save the results in results/.

We have provided an example script bioriskeval/gen/bioriskeval_gen.sh for quick start.

BioRiskEval-Mut

The workflow of BioRiskEval-Mut under the zero-shot/loglikelihood setting is:

Process protein sequences for DMS (Deep Mutational Scanning) into nucleotides with nucleotide_data_pipeline.py
eval_fitness.py calculates log-likelihood based score for auto-regressive genomic models on mutational sequences, and Spearman correlation with the ground truth experimental fitness is reported for each DMS. eval_fitness_esm2.py calculates scoring with masked marginals for ESM2 protein models.

We provide an example script bioriskeval/mut/bioriskeval_mut_logprob.sh for quick start.

The workflow of BioRiskEval-Mut under the probe setting is:

Pick $k$ numbers of mutations from each DMS to fit linear probes. Within the k mutations, 80% are used to fit and 20% are used as the validation split. Rest of the data is used as test split. create_dms_probe_dataset.py create the splits and saves representations for train and val splits.
Sweep over all layers with sweep_dms_probe.py to find the best layer for fitting the linear probe. Best probe based on train RMSE or validation split spearman are saved.
Save test representation based on the best layer with probe_layer_utils.py and create_dms_probe_dataset.py. Evaluate saved probes with test_dms_probe.py

We provide an example script bioriskeval/mut/bioriskeval_mut_probe.sh for quick start.

BioRiskEval-Vir

The workflow of BioRiskEval-Vir is:

Extract hidden-layer representations, create train-test split (7:3) for probing.
Train a linear probe on the train set, and evaluate on the test set. train_probe_continuous.py will train the linear probe and evaluate its performance on the test set. It will also uplaod the results to Weights & Biases and dumpe the results to a csv file.

We have provided an example script bioriskeval/vir/bioriskeval_vir.sh for quick start.

Fine-Tuning & Probing

Fine-tuning

Inside attack/, we have the scripts for fine-tuning.

The workflow of fine-tuning is:

Have the csv file with accession ids in column #Accession
Convert the csv file to fna file using convert_csv_to_fna.py
Create the train-val split, tokenize the data
Create dataset config for fine-tuning
Fine-tune the model

We provide an example script in attack/data/preprocess_ft_data.sh (preprocess data, step 1-4) and attack/ft/launch_ft_7b_1m.sh (fine-tuning the model, step 5) for quick start, in which you can modify the csv file path and change the preprocess config.

Probing

The workflow of probing is:

Extract the hidden-layer representations from the model
Train a linear probe on the train set
Evaluate the probe on the test set

Refer to the example scripts bioriskeval/vir/bioriskeval_vir.sh for quick start.

Reproducibility

We documented the results in attack/analysis/, which contains the raw results and scripts for analysis and plotting.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.devcontainer		.devcontainer
.vscode		.vscode
LICENSE		LICENSE
assets		assets
attack		attack
bioriskeval		bioriskeval
ci		ci
docker_build_patches		docker_build_patches
docs		docs
internal		internal
scripts		scripts
st_configs		st_configs
sub-packages		sub-packages
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.nspect-allowlist.toml		.nspect-allowlist.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets-nb.baseline		.secrets-nb.baseline
.secrets.baseline		.secrets.baseline
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
README.md		README.md
README_bionemo.md		README_bionemo.md
SECURITY.md		SECURITY.md
VERSION		VERSION
justfile		justfile
license_header		license_header
pyproject.toml		pyproject.toml
requirements-cve.txt		requirements-cve.txt
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
tach.toml		tach.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Installation

Build the docker image and develop environment

Download the repository

Download Dataset

Build the docker image

VSCode Devcontainer for Interactive Debugging

Convert Checkpoints

BioRiskEval

BioRiskEval-Gen

BioRiskEval-Mut

BioRiskEval-Vir

Fine-Tuning & Probing

Fine-tuning

Probing

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Languages

scaleapi/BioRiskEval-os

Folders and files

Latest commit

History

Repository files navigation

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Installation

Build the docker image and develop environment

Download the repository

Download Dataset

Build the docker image

VSCode Devcontainer for Interactive Debugging

Convert Checkpoints

BioRiskEval

BioRiskEval-Gen

BioRiskEval-Mut

BioRiskEval-Vir

Fine-Tuning & Probing

Fine-tuning

Probing

Reproducibility

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages