Skip to content
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
93b78e3
Add summary to contributing.md
Jul 28, 2025
fdc3431
fix the tone of the contributing summary
Jul 28, 2025
ec21e17
Fix conflict
Jul 28, 2025
0095ccc
Merge branch 'main' of https://github.com/nvidia/bionemo-framework in…
Aug 1, 2025
c563659
Initial scdl schema implementation
Aug 1, 2025
56e6998
Merge branch 'main' of https://github.com/nvidia/bionemo-framework in…
Aug 6, 2025
ca0bf25
Update the readme with developer instructions
Aug 6, 2025
e8f40e7
Refactor memmap dataset tests so that testing of neighbor functions l…
Aug 6, 2025
5877457
Add a magic number
Aug 6, 2025
6b8d1a8
Add the draft schema
Aug 6, 2025
4667ee9
Add the version module
Aug 6, 2025
0a7a7e0
Fix dependencies
Aug 6, 2025
8ca871e
Fix a minor typo in docstring.
Aug 6, 2025
c1bcda7
Add the header implementation, the backing support module, and relate…
Aug 6, 2025
a6b87c5
Add header API docs
Aug 6, 2025
777def5
Update version, tests and docs
Aug 14, 2025
a57b905
Update the version and integrate the header
Aug 14, 2025
ebca4d7
Merge branch 'main' of https://github.com/nvidia/bionemo-framework in…
Aug 14, 2025
35af2b2
Unstage changes to contribtuting.md
Aug 14, 2025
58a0323
Unstage changes to contribtuting.md
Aug 14, 2025
fee36d3
Move bionemo core to dev dep
Aug 14, 2025
ecd5f14
Sync docs on endianness to NETWORK, not little
Aug 14, 2025
7c5084f
Add header to load
Aug 14, 2025
6f16b25
Move the changelog and schema doc to docs.
Aug 14, 2025
76e0d55
Add license + init.py
Aug 15, 2025
e768f60
Merge branch 'main' into edawson/scdl-schema
yzhang123 Aug 15, 2025
cb53ac7
Address linting errors.
Aug 15, 2025
9ecc929
Address test failures by moving paths to strs and making sure header …
Aug 16, 2025
73982f1
Merge branch 'main' of https://github.com/nvidia/bionemo-framework in…
Aug 16, 2025
cac64e7
Address precommit failures
Aug 16, 2025
83e2233
skip flip download dataset for now
yzhang123 Aug 18, 2025
dd1934b
Fix a bug in scdl header saving. Files were saved to data_path, but t…
Aug 18, 2025
1282f93
Change README text reflecting updated dev->test dep configuration.
Aug 18, 2025
a59ce63
Precommit found lint errors so here are the formatted files.
Aug 19, 2025
4622e74
Small Changes for SCDL release (#1051)
polinabinder1 Aug 19, 2025
f942e3c
fix scdl miscrepancy after code update in geneformer notebook (#1054)
yzhang123 Aug 19, 2025
2cbd696
fix precommit
yzhang123 Aug 19, 2025
a720862
fix precommit
yzhang123 Aug 19, 2025
9719cee
Merge branch 'main' into edawson/scdl-schema
polinabinder1 Aug 19, 2025
0f8f1f2
added release notes
yzhang123 Aug 19, 2025
6698414
Merge branch 'edawson/scdl-schema' of https://github.com/NVIDIA/bione…
yzhang123 Aug 19, 2025
33d6fa8
added release notes
yzhang123 Aug 19, 2025
74af02e
Update scdl-schema.md
polinabinder1 Aug 19, 2025
e4acd0f
notebook output
polinabinder1 Aug 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docs/docs/main/about/releasenotes-fw.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,46 @@
# Release Notes

## BioNeMo Framework v2.7

### Updates & Improvements

- Adds a header to SCDL archives, providing improved provenance tracking and supporting future releases. Also adds tracking of the AnnData API coverage in SCDL tests.
This header stores metadata about the archive and its composite arrays, including a version, the array lengths and data types, and information about the RowFeatureIndexes. This adds the features necessary to fix https://github.com/NVIDIA/bionemo-framework/issues/999 as well as implement simple bit-packing of the rowptr, colptr, and data arrays. It also should make SCDL more secure, enable strict compatibility checking, and open the door to more performance improvements. https://github.com/NVIDIA/bionemo-framework/pull/1030

## BioNeMo Framework v2.6.3

### Updates & Improvements

- Fixes numerous issues with Evo2 model:
1. Inference/Generation issues resolved. https://github.com/NVIDIA/bionemo-framework/issues/890
2. FP8 training resumption issues resolved. https://github.com/NVIDIA/bionemo-framework/issues/973
3. Bug in inference script that concerns checkpoint loading is fixed. https://github.com/NVIDIA/bionemo-framework/pull/950
- ESM2 LoRA model inference issue resolved. https://github.com/NVIDIA/bionemo-framework/pull/996
- Added experimental evo2-mamba model. https://github.com/NVIDIA/bionemo-framework/pull/888
- Updated base Docker image to [nvidia-pytorch 25.06-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags)
- NCCL issue in ESM2 pretraing resolved. https://github.com/NVIDIA/bionemo-framework/issues/970

## What's Changed

- Fix test_train_evo2_stops test by @balvisio in https://github.com/NVIDIA/bionemo-framework/pull/965
- Enable test_train_evo2_stop_at_max_steps_and_continue. by @balvisio in https://github.com/NVIDIA/bionemo-framework/pull/966
- automated benchmarks: esm2 650M training analogous to bionemo-recipes by @dorotat-nv in https://github.com/NVIDIA/bionemo-framework/pull/975
- Fix database path in esm2_pretrain_recipes by @pstjohn in https://github.com/NVIDIA/bionemo-framework/pull/978
- Add fp8 stop and go test for evo2 by @jwilber in https://github.com/NVIDIA/bionemo-framework/pull/974
- Update Docs Banner for GitHub Pages-hosted Docs by @tshimko-nv in https://github.com/NVIDIA/bionemo-framework/pull/981
- Add release notes for v2.6.2 (25.06) by @trvachov in https://github.com/NVIDIA/bionemo-framework/pull/971
- Evo2 Generation fixes and necessary base dependency and container updates. Large change. by @jwilber in https://github.com/NVIDIA/bionemo-framework/pull/949
- Point NeMo submodule back to main repo by @trvachov in https://github.com/NVIDIA/bionemo-framework/pull/984
- Use new b2b kernels in evo2 jet tests by @jwilber in https://github.com/NVIDIA/bionemo-framework/pull/985
- change where dtype is found in checkpoint export by @pstjohn in https://github.com/NVIDIA/bionemo-framework/pull/989
- Evo2 Mamba by @jstjohn in https://github.com/NVIDIA/bionemo-framework/pull/888
- Adding inference CDS length tests by @jstjohn in https://github.com/NVIDIA/bionemo-framework/pull/991
- Fix PIL CVE by @trvachov in https://github.com/NVIDIA/bionemo-framework/pull/992
- (BIONEMO-2334) Patch TE to fix Evo2 stop and go training by @balvisio in https://github.com/NVIDIA/bionemo-framework/pull/987
- Fix bug in evo2-mamba train and add test by @jstjohn in https://github.com/NVIDIA/bionemo-framework/pull/994
- Fix esm2 lora inference by @yzhang123 in https://github.com/NVIDIA/bionemo-framework/pull/996
- Reset parameters for the ESM-2 contact head on HF export by @pstjohn in https://github.com/NVIDIA/bionemo-framework/pull/983

## BioNeMo Framework v2.6.2

### Updates & Improvements
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
import os
from pathlib import Path

import pytest

from bionemo.esm2.model.finetune.flip_preprocess import FLIPPreprocess


Expand All @@ -30,6 +32,7 @@ def test_flip_preprocess_initialization(tmpdir):
assert flip.root_directory == Path(tmpdir)


@pytest.mark.skip(reason="Need to fix the test")
def test_prepare_all_datasets(tmpdir):
"""Test prepare_all_datasets method."""
flip = FLIPPreprocess(root_directory=tmpdir)
Expand All @@ -56,6 +59,7 @@ def test_prepare_all_datasets(tmpdir):
assert os.path.exists(csv_file), f"x000.csv not found in {task}/{split} directory"


@pytest.mark.skip(reason="Need to fix the test")
def test_download_flip_data(tmpdir):
"""Test download_FLIP_data method with slow marker."""
flip = FLIPPreprocess(root_directory=tmpdir)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@
"['col_ptr.npy',\n",
" 'data.npy',\n",
" 'features',\n",
" 'header.sch',\n",
" 'metadata.json',\n",
" 'row_ptr.npy',\n",
" 'version.json']"
Expand Down Expand Up @@ -1459,7 +1460,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,21 +44,21 @@ def test_load_sc_datasets(tmp_path, test_directory_feat_ids):
tokenizer = MagicMock()
sc_memmap_dataset_path0 = tmp_path / "test_data_0"
ds_0 = SingleCellMemMapDataset(
sc_memmap_dataset_path0, h5ad_path=test_directory_feat_ids / "adata_sample0.h5ad"
str(sc_memmap_dataset_path0), h5ad_path=str(test_directory_feat_ids / "adata_sample0.h5ad")
) # create the memmap dataset format from h5ad for testing purposes
dataset0 = SingleCellDataset(sc_memmap_dataset_path0, tokenizer)
dataset0 = SingleCellDataset(str(sc_memmap_dataset_path0), tokenizer)
assert len(dataset0) == len(ds_0) == 8
sc_memmap_dataset_path1 = tmp_path / "test_data_1"
ds_1 = SingleCellMemMapDataset(
sc_memmap_dataset_path1, h5ad_path=test_directory_feat_ids / "adata_sample1.h5ad"
str(sc_memmap_dataset_path1), h5ad_path=str(test_directory_feat_ids / "adata_sample1.h5ad")
) # create the memmap dataset format from h5ad for testing purposes
dataset1 = SingleCellDataset(sc_memmap_dataset_path1, tokenizer)
dataset1 = SingleCellDataset(str(sc_memmap_dataset_path1), tokenizer)
assert len(dataset1) == len(ds_1) == 6
sc_memmap_dataset_path2 = tmp_path / "test_data_2"
ds_2 = SingleCellMemMapDataset(
sc_memmap_dataset_path2, h5ad_path=test_directory_feat_ids / "adata_sample2.h5ad"
str(sc_memmap_dataset_path2), h5ad_path=str(test_directory_feat_ids / "adata_sample2.h5ad")
) # create the memmap dataset format from h5ad for testing purposes
dataset2 = SingleCellDataset(sc_memmap_dataset_path2, tokenizer)
dataset2 = SingleCellDataset(str(sc_memmap_dataset_path2), tokenizer)
assert len(dataset2) == len(ds_2) == 100


Expand All @@ -82,12 +82,12 @@ def test_gene_not_in_tok_vocab(tmp_path, test_directory_feat_ids):
adata.var["feature_id"] = synthetic_ids
adata.write(sc_h5ad_dataset_path0)
SingleCellMemMapDataset(
sc_memmap_dataset_path0, h5ad_path=sc_h5ad_dataset_path0
str(sc_memmap_dataset_path0), h5ad_path=str(sc_h5ad_dataset_path0)
) # create the memmap dataset format from h5ad for testing purposes
preprocessor = GeneformerPreprocess(
download_directory=sc_memmap_dataset_path0,
medians_file_path=sc_memmap_dataset_path0 / "medians.json",
tokenizer_vocab_path=sc_memmap_dataset_path0 / "geneformer.vocab",
download_directory=str(sc_memmap_dataset_path0),
medians_file_path=str(sc_memmap_dataset_path0 / "medians.json"),
tokenizer_vocab_path=str(sc_memmap_dataset_path0 / "geneformer.vocab"),
)
match preprocessor.preprocess():
case {"tokenizer": tokenizer, "median_dict": median_dict}:
Expand All @@ -96,14 +96,14 @@ def test_gene_not_in_tok_vocab(tmp_path, test_directory_feat_ids):
logging.error("Preprocessing failed.")

dataset0 = SingleCellDataset(
sc_memmap_dataset_path0, tokenizer, median_dict=median_dict, include_unrecognized_vocab_in_dataset=True
str(sc_memmap_dataset_path0), tokenizer, median_dict=median_dict, include_unrecognized_vocab_in_dataset=True
) # type: ignore
index = EpochIndex(epoch=0, idx=3)
with pytest.raises(ValueError) as error_info:
dataset0.__getitem__(index)
assert "not in the tokenizer vocab." in str(error_info.value)
dataset0 = SingleCellDataset(
sc_memmap_dataset_path0,
str(sc_memmap_dataset_path0),
tokenizer,
median_dict=median_dict,
) # type: ignore
Expand All @@ -115,12 +115,12 @@ def test_gene_not_in_tok_vocab(tmp_path, test_directory_feat_ids):
def test_empty_gene_data_input(tmp_path, test_directory_feat_ids):
sc_memmap_dataset_path0 = tmp_path / "test_data_0"
SingleCellMemMapDataset(
sc_memmap_dataset_path0, h5ad_path=test_directory_feat_ids / "adata_sample0.h5ad"
str(sc_memmap_dataset_path0), h5ad_path=str(test_directory_feat_ids / "adata_sample0.h5ad")
) # create the memmap dataset format from h5ad for testing purposes
preprocessor = GeneformerPreprocess(
download_directory=sc_memmap_dataset_path0,
medians_file_path=sc_memmap_dataset_path0 / "medians.json",
tokenizer_vocab_path=sc_memmap_dataset_path0 / "geneformer.vocab",
download_directory=str(sc_memmap_dataset_path0),
medians_file_path=str(sc_memmap_dataset_path0 / "medians.json"),
tokenizer_vocab_path=str(sc_memmap_dataset_path0 / "geneformer.vocab"),
)
match preprocessor.preprocess():
case {"tokenizer": tokenizer, "median_dict": median_dict}:
Expand All @@ -139,7 +139,7 @@ def test_empty_gene_data_input(tmp_path, test_directory_feat_ids):

def test_lookup_row(tmp_path, cellx_small_directory):
tokenizer = MagicMock()
dataset = SingleCellDataset(tmp_path / cellx_small_directory / "val", tokenizer)
dataset = SingleCellDataset(str(tmp_path / cellx_small_directory / "val"), tokenizer)
values, feature_ids = dataset.scdl.get_row(0, return_features=True, feature_vars=["feature_id"])
gene_data, col_idxs = values[0], values[1]
assert len(gene_data) == 440
Expand Down Expand Up @@ -169,7 +169,7 @@ def test_get_item_synthetic(tmp_path, test_directory_feat_ids):
case _:
logging.error("Preprocessing failed.")
dataset0 = SingleCellDataset(
sc_memmap_dataset_path0,
str(sc_memmap_dataset_path0),
tokenizer,
median_dict=median_dict,
mask_token_prob=0,
Expand All @@ -188,17 +188,17 @@ def test_get_item_synthetic(tmp_path, test_directory_feat_ids):

def test_GeneformerDataset_changes_with_epoch(tmp_path, cellx_small_directory):
preprocessor = GeneformerPreprocess(
download_directory=tmp_path / cellx_small_directory / "val",
medians_file_path=tmp_path / cellx_small_directory / "val" / "medians.json",
tokenizer_vocab_path=tmp_path / cellx_small_directory / "val" / "geneformer.vocab",
download_directory=str(tmp_path / cellx_small_directory / "val"),
medians_file_path=str(tmp_path / cellx_small_directory / "val" / "medians.json"),
tokenizer_vocab_path=str(tmp_path / cellx_small_directory / "val" / "geneformer.vocab"),
)
match preprocessor.preprocess():
case {"tokenizer": tokenizer, "median_dict": median_dict}:
logging.info("*************** Preprocessing Finished ************")
case _:
logging.error("Preprocessing failed.")
genformer_ds = SingleCellDataset(
tmp_path / cellx_small_directory / "val",
str(tmp_path / cellx_small_directory / "val"),
tokenizer, # type: ignore
median_dict=median_dict, # type: ignore
) # type: ignore
Expand All @@ -212,17 +212,17 @@ def test_GeneformerDataset_changes_with_epoch(tmp_path, cellx_small_directory):

def test_get_item_cellx(tmp_path, cellx_small_directory):
preprocessor = GeneformerPreprocess(
download_directory=tmp_path / cellx_small_directory / "val",
medians_file_path=tmp_path / cellx_small_directory / "val" / "medians.json",
tokenizer_vocab_path=tmp_path / cellx_small_directory / "val" / "geneformer.vocab",
download_directory=str(tmp_path / cellx_small_directory / "val"),
medians_file_path=str(tmp_path / cellx_small_directory / "val" / "medians.json"),
tokenizer_vocab_path=str(tmp_path / cellx_small_directory / "val" / "geneformer.vocab"),
)
match preprocessor.preprocess():
case {"tokenizer": tokenizer, "median_dict": median_dict}:
logging.info("*************** Preprocessing Finished ************")
case _:
logging.error("Preprocessing failed.")
ds = SingleCellDataset(
tmp_path / cellx_small_directory / "val",
str(tmp_path / cellx_small_directory / "val"),
tokenizer, # type: ignore
median_dict=median_dict, # type: ignore
mask_prob=0,
Expand Down
35 changes: 29 additions & 6 deletions sub-packages/bionemo-scdl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,13 +163,9 @@ convert_h5ad_to_scdl --data-path hdf5s --save-path example_dataset

## Runtimes with SCDL

The runtime and memory usage are examined on a CellXGene Dataset with ~1.5 million rows and a size of 24 GB. On this dataset, there is a 4.9x memory speed up.
The runtime is examined on the Tahoe 100M dataset, which containes over 100 million rows. On this dataset, there is either a 12x or 53x speed up depending on the machine used.

![Throughput Image](https://raw.githubusercontent.com/NVIDIA/bionemo-framework/main/sub-packages/bionemo-scdl/assets/throughput.png)

Additionally, the peak memory usage when iterating over the datasets with the SCDL dataloader is only 36.5 MB, since the whole dataset is never loaded into memory due to the numpy memomory-mapped backing.

![Memory Image](https://raw.githubusercontent.com/NVIDIA/bionemo-framework/main/sub-packages/bionemo-scdl/assets/disk_space.png)
![Throughput](https://raw.githubusercontent.com/NVIDIA/bionemo-framework/pbinder/scdl_add_to_edawson/sub-packages/bionemo-scdl/assets/tahoe_throughput.png)

### Using Neighbor Information in Single Cell Datasets

Expand Down Expand Up @@ -260,3 +256,30 @@ and data loading performance.
## LICENSE

BioNeMo-SCDL has an Apache 2.0 license, as found in the LICENSE file.

## Contributing

Please follow the guidelines for contributions to the BioNeMo Framework.

To contribute to SCDL, we recommend installing additional dependencies for development and
installing the SCDL package from source.

```bash
git clone https://github.com/NVIDIA/bionemo-framework.git
cd bionemo-framework/sub-packages/bionemo-scdl
pip install -e ".[test]"
```

### Tests

SCDL has its own tests. To run these tests, assuming you have pytest installed:

```
python -m pytest
```

To run a specific test:

```bash
python -m pytest tests/test_<test name>.py
```
2 changes: 1 addition & 1 deletion sub-packages/bionemo-scdl/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.0.7
0.1.0
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading