Skip to content

Commit 69b6433

Browse files
authored
Update README.md
Signed-off-by: savitha-eng <savithas@nvidia.com>
1 parent 57f30e2 commit 69b6433

File tree

1 file changed

+16
-10
lines changed
  • bionemo-recipes/recipes/opengenome2_llama_native_te

1 file changed

+16
-10
lines changed

bionemo-recipes/recipes/opengenome2_llama_native_te/README.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,13 @@
22

33
This folder demonstrates how to train TE-accelerated Llama 3 with a native PyTorch training loop for autoregressive DNA token prediction on the metagenome subset of the OpenGenome2 genomic dataset. It uses fully sharded data parallel (FSDP2), THD sequence packing, a custom nucleotide tokenizer, and supports FP32 master weights. Convergence has been validated against the Megatron/ShardedEden OpenGenome2 (OG2) baseline.
44

5+
### How to deploy this recipe on cloud providers
6+
7+
🚧 Under development
8+
59
## How to use this recipe
610

7-
This folder is a self-contained training example. You can download a zipped directory of this recipe
8-
alone by clicking
11+
This folder contains an independent, minimal training example. It does not depend on any other code in the top-level bionemo-framework repository. You can download a zipped directory of this folder alone by clicking
912
[here](https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te).
1013

1114
## Supported Models and Training Features
@@ -16,6 +19,7 @@ alone by clicking
1619

1720
✅: Supported <br/>
1821
🚧: Under development <br/>
22+
❌: Not supported <br/>
1923

2024
\[1\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 9.0 and above (Hopper+) <br/>
2125
\[2\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 10.0 and 10.3 (Blackwell), 12.0 support pending <br/>
@@ -26,35 +30,37 @@ weight decay grouping, and the genomic data collator.
2630
## Installing Dependencies
2731

2832
The easiest way to get started is to use the provided Dockerfile, which uses an NVIDIA PyTorch base
29-
image with optimized PyTorch and TransformerEngine. To build and run:
33+
image to provide optimized versions of PyTorch and TransformerEngine. To build the container, run:
3034

3135
```bash
3236
docker build -t og2_llama_te .
3337
docker run -it --gpus all --network host --ipc=host --rm -v ${PWD}:/workspace/bionemo og2_llama_te /bin/bash
3438
```
3539

36-
Alternatively, install dependencies manually in an environment with CUDA support. See
37-
`requirements.txt` for the list of dependencies.
40+
Alternatively, the dependencies can be installed manually in an environment with CUDA support. See `requirements.txt`
41+
for the list of dependencies.
3842

3943
## Convergence Benchmarks (vs Megatron Baseline)
4044

4145
Our baseline is the Megatron/NeMo Llama 3 model trained with the BCR ShardedEden dataloader. To
42-
improve convergence and training stability, we adopted features used in the Megatron stack:
46+
improve convergence and training stability for the OG2 recipe, we adopted features used in the Megatron stack:
4347
Spike-No-More embeddings, scaled initialization of output projections (proj/fc2), and BF16 compute
4448
with FP32 master weights.
4549

46-
This recipe uses THD sequence packing, whereas the Megatron baseline uses a standard BSHD dataloader.
50+
However, this recipe uses THD sequence packing for training, whereas the Megatron baseline uses a standard BSHD dataloader.
4751
On the metagenome dataset, the median sequence length is ~2.2k and the average is ~4k, so with THD we
4852
process roughly 2–3× more tokens per training step (less padding waste). As a result, this recipe
49-
achieves significantly better convergence than the Megatron baseline at a matched global batch size.
53+
achieves significantly better convergence [TODO: add %] than the Megatron baseline at a matched global batch size.
5054
Both runs use FP32 master weights; the Megatron baseline uses FP8 training and we use BF16. Reported
51-
results use GBS 384 on 6× H100 nodes (48 GPUs).
55+
results use GBS 384 on 6× H100 nodes (48 GPUs). Note that we also use bf16/fp32 training while the Megatron baseline uses fp8/fp32 training
56+
which may also contribute to its lower test performance.
5257

58+
TODO: Fill with final results/replace with final image
5359
<p align="center">
5460
<img src="assets/og2_convergence_vs_megatron.png" alt="OpenGenome2 7B convergence vs Megatron" width="80%" />
5561
</p>
5662

57-
| Model | Step / checkpoint | Perplexity | Train loss | Test loss |
63+
| Model | Step / checkpoint | Train loss| Test loss | Perplexity |
5864
| -------------------------- | ----------------- | ---------- | ---------- | --------- |
5965
| This recipe (OG2 7B) | - | - |||
6066
| Megatron baseline (OG2 7B) | - | | | |

0 commit comments

Comments
 (0)