Skip to content

Commit 2214c45

Browse files
authored
NeMo2 to Megatron Bridge Checkpoint Conversion and Fine-tuning (#1411)
### Description Example scripts demonstrating fine-tuning from a starting checkpoint, and a checkpoint conversion script for migrating nemo2 checkpoints to megatron bridge. #### Usage See README.md in `evo2_megatron` recipe in this PR for usage. ### Type of changes <!-- Mark the relevant option with an [x] --> - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully --------- Signed-off-by: John St. John <jstjohn@nvidia.com>
1 parent 29241e4 commit 2214c45

File tree

8 files changed

+685
-12
lines changed

8 files changed

+685
-12
lines changed

bionemo-recipes/recipes/evo2_megatron/README.md

Lines changed: 114 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,35 +19,142 @@ uv pip install -c pip-constraints.txt -e . --no-build-isolation
1919

2020
## Usage
2121

22+
### Example job
23+
2224
```
2325
# 3. Run an example job
2426
## 2. if on a6000s, you may need to disable p2p to avoid crashing
2527
export NCCL_P2P_DISABLE=1
2628
## 3. Run the job:
27-
torchrun --nproc-per-node 8 --no-python \
29+
torchrun --nproc-per-node 2 --no-python \
2830
train_evo2 \
2931
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 \
3032
--model-size striped_hyena_1b_nv_parallel --max-steps 12 --eval-interval 10 \
3133
--eval-iters 3 --mock-data \
32-
--micro-batch-size 32 --global-batch-size 256 --seq-length 1024 \
34+
--micro-batch-size 16 --global-batch-size 32 --seq-length 1024 \
3335
--tensor-model-parallel 1 \
3436
--use-precision-aware-optimizer --dataset-seed 33 \
35-
--seed 41 --ckpt-async-save --spike-no-more-embedding-init \
37+
--seed 41 --spike-no-more-embedding-init \
3638
--no-weight-decay-embeddings --cross-entropy-loss-fusion \
3739
--align-param-gather --overlap-param-gather --grad-reduce-in-fp32 \
3840
--decay-steps 100 --warmup-steps 10 \
39-
--mixed-precision-recipe bf16-mixed \
41+
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
4042
--no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 \
4143
--attention-dropout 0.001 --hidden-dropout 0.001 \
4244
--eod-pad-in-loss-mask --enable-preemption \
4345
--log-interval 5 --debug-ddp-parity-freq 10 \
44-
--wandb-project evo2-recipes-verification-tmp \
45-
--wandb-run-name tmp_workstation_run_mock_data \
46-
--result-dir tmpbf16 --no-renormalize-loss
46+
--result-dir tmpfp8 --no-renormalize-loss
47+
```
48+
49+
### Example fine-tune from an existing checkpoint
50+
51+
First convert the checkpoint from nemo2 format (temporary step until we upload the new files)
52+
53+
Good checkpoint names to try are:
54+
55+
- evo2/1b-8k-bf16:1.0 (model_size: 1b)
56+
- evo2/7b-1m:1.0 (model_size: 7b_arc_longcontext)
57+
- evo2/40b-1m-fp8-bf16:1.0 (model_size: 40b_arc_longcontext)
58+
59+
Other than the 7b version, the other two are checkpoints fine-tuned by the BioNeMo team to support both FP8 and BF16
60+
precision. The 7b version worked well on both FP8 and BF16 out of the box so it was not fine-tuned further. If you do
61+
want to use one of the FP8 sensitive checkpoints, like `evo2/40b-1m` then be sure to add the `--vortex-style-fp8`
62+
option to the checkpoint conversion step below. Also note that although 8k versions of the 7b and 40b checkpoints exist,
63+
it is advisable to use the longer context versions since they were trained further and still run on shorter inputs.
64+
65+
See `download_bionemo_data --list-resources` for other checkpoint options and a list of available
66+
downloadable resources.
67+
4768
```
69+
CKPT_NAME=evo2/1b-8k-bf16:1.0
70+
CKPT_OUT_DIR=evo2_1b_8k_bf16_mbridge
71+
evo2_convert_nemo2_to_mbridge \
72+
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
73+
--tokenizer-path tokenizers/nucleotide_fast_tokenizer_512 \
74+
--model-size 1b \
75+
--seq-length 8192 \
76+
--nemo2-ckpt-dir $(download_bionemo_data $CKPT_NAME) \
77+
--mbridge-ckpt-dir $CKPT_OUT_DIR
78+
79+
```
80+
81+
Now run like before, but include the fine-tuned checkpoint directory you converted in the previous step with
82+
`--finetune-ckpt-dir $CKPT_OUT_DIR`. Also if you have problems with `bf16_with_fp8_current_scaling_mixed` try
83+
`bf16_mixed`.
84+
85+
```
86+
torchrun --nproc-per-node 2 --no-python \
87+
train_evo2 \
88+
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_512 \
89+
--model-size 1b --max-steps 12 --eval-interval 10 \
90+
--eval-iters 3 --mock-data \
91+
--micro-batch-size 16 --global-batch-size 32 --seq-length 1024 \
92+
--tensor-model-parallel 1 \
93+
--use-precision-aware-optimizer --dataset-seed 33 \
94+
--seed 41 \
95+
--cross-entropy-loss-fusion \
96+
--align-param-gather --overlap-param-gather --grad-reduce-in-fp32 \
97+
--decay-steps 100 --warmup-steps 10 \
98+
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
99+
--no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 \
100+
--attention-dropout 0.001 --hidden-dropout 0.001 \
101+
--eod-pad-in-loss-mask --enable-preemption \
102+
--log-interval 5 --debug-ddp-parity-freq 10 \
103+
--result-dir tmpfp8-ft-example --no-renormalize-loss \
104+
--finetune-ckpt-dir $CKPT_OUT_DIR
105+
```
106+
107+
## Where do the custom command line programs come from?
108+
109+
See `pyproject.toml` for where runnable programs like `train_evo2` and `evo2_convert_nemo2_to_mbridge` are implemented
110+
in code.
48111

49112
## Docker build
50113

51114
```
52115
docker build -t evo2_megatron_recipe-$(git rev-parse --short HEAD) .
53116
```
117+
118+
## Performance and accuracy comparisons
119+
120+
NOTE: this section is largely a work in progress. This reflects the most updated information, but may not reflect the
121+
current state of the code base at any given time.
122+
123+
### Training accuracy convergence
124+
125+
We ran a 12 hour 48 H100 GPU training run to compare megatron bridge with nemo2. We found that FP8 current scaling
126+
converges by around the 5,000th step to the bf16 lines. And that bf16 is comparable with nemo2. Interestingly in nemo2
127+
bf16 and fp8 followed nearly identical trajectories for the first 5k steps as well. Note that in a typical training run
128+
we are performing over 100k steps, so different behavior in the first 5k steps is less worrisome if the endpoints are
129+
comparable.
130+
131+
![Training Convergence Comparison](assets/mbridge_to_nemo_training_convergence_7ksteps.png)
132+
133+
### Training performance comparisons
134+
135+
FP8 current scaling which is supposed to have better convergence properties than delayed scaling, performs nearly as
136+
well as delayed scaling in mbridge. Even leaving multiple transformer layers in bf16 precision trains faster than fp8
137+
delayed scaling in nemo2.
138+
139+
| Evo2 1B Run | Seconds per step (lower is better) | Tokens/sec/GPU | Global Batch Size | Number of GPUs | Vocab Size |
140+
| :----------------------------------------------: | :--------------------------------: | :------------: | :---------------: | :------------: | :--------: |
141+
| MBridge BF16 | 6.10 | 26,859 | 960 | 48 | 256 |
142+
| MBridge FP8 (delayed) | 5.38 | 30,453 | 960 | 48 | 256 |
143+
| MBridge FP8 (current) | 5.44 | 28,755 | 960 | 48 | 512 |
144+
| MBridge FP8 (current first/last two layers bf16) | 5.47 | 28,598 | 960 | 48 | 512 |
145+
| Nemo2 FP8 (delayed) | 6.18 | 26,511 | 960 | 48 | 512 |
146+
147+
Activation memory optimizations have enabled context parallelism to work better with evo2 style models in our mbridge
148+
implementation than the previous nemo2 implementation. Since TP requires more node to node communication, you generally
149+
want to limit TP to your fastest interconnects, which are typically configured in nodes of 8 GPUs. Evo2 would previously
150+
OOM with these more ideal configurations, requiring much larger than typical levels of TP to handle long context
151+
training. With our latest changes to the evo2 forward pass, we can now handle more typical TP vs CP configurations.
152+
This enables significantly faster step timing at long context, as well as demonstrating up to 2M context length. We
153+
have currently demonstrated small training runs at 2M context on only 512 H100 GPUs for the 40b parameter model.
154+
155+
| Configuration | Precision | TP | CP | Number of Nodes | Number of GPUs | Context Length | Global Batch Size | Seconds per Step |
156+
| :---------------: | :---------: | :-: | :-: | :-------------: | :------------: | :------------: | :---------------: | :--------------: |
157+
| NeMo2 | fp8-delayed | 64 | 2 | 32 | 256 | 1M | 2 | 44 |
158+
| NeMo2 | fp8-delayed | 8 | 16 | 32 | 256 | 1M | 2 | OOM |
159+
| MBridge Optimized | bf16 | 8 | 16 | 32 | 256 | 1M | 2 | 30 |
160+
| 2M Stress Test | bf16 | 8 | 32 | 64 | 512 | 2M | 2 | 48 |
102 KB
Loading

bionemo-recipes/recipes/evo2_megatron/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ train_evo2 = "bionemo.evo2.run.train:main"
4040
#predict_evo2 = "bionemo.evo2.run.predict:main"
4141
preprocess_evo2 = "bionemo.evo2.data.preprocess:main"
4242
splice_evo2 = "bionemo.evo2.data.transcript_extraction:main"
43+
evo2_convert_nemo2_to_mbridge = "bionemo.evo2.utils.checkpoint.nemo2_to_mbridge:main"
4344
#evo2_convert_to_nemo2 = "bionemo.evo2.utils.checkpoint.convert_to_nemo:main"
4445
#evo2_nemo2_to_hf = "bionemo.evo2.utils.checkpoint.nemo2_to_hf:main"
4546
#evo2_remove_optimizer = "bionemo.evo2.utils.checkpoint.evo2_remove_optimizer:main"

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/recipes/evo2.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -267,7 +267,6 @@ def _evo2_common(
267267
),
268268
tokenizer=TokenizerConfig(
269269
tokenizer_type="HuggingFaceTokenizer",
270-
hf_tokenizer_kwargs={"trust_remote_code": True},
271270
tokenizer_model=hf_tokenizer_model_or_path or "EleutherAI/gpt-neox-20b",
272271
),
273272
checkpoint=CheckpointConfig(

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/train.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -710,9 +710,9 @@ def train(args: argparse.Namespace) -> None:
710710
recipe_kwargs["stride"] = args.stride
711711
recipe_kwargs["window_min_length_threshold"] = args.window_min_length_threshold
712712
recipe_kwargs["rc_aug"] = args.rc_aug
713-
elif args.dataset_config_path:
713+
elif args.dataset_config:
714714
recipe_kwargs["dataset_dir"] = args.dataset_dir
715-
recipe_kwargs["dataset_config_path"] = args.dataset_config_path
715+
recipe_kwargs["dataset_config_path"] = args.dataset_config
716716

717717
recipe_kwargs["pad_eod_loss_mask"] = args.eod_pad_in_loss_mask
718718

@@ -918,6 +918,7 @@ def train(args: argparse.Namespace) -> None:
918918
if args.finetune_ckpt_dir:
919919
cfg.checkpoint.finetune = True
920920
cfg.checkpoint.pretrained_checkpoint = args.finetune_ckpt_dir
921+
cfg.checkpoint.dist_ckpt_strictness = "ignore_all" # necessary unfortunately to avoid extra_state issues.
921922
if args.nvidia_fault_tolerance:
922923
cfg.ft = FaultToleranceConfig(
923924
enable_ft_package=True,

0 commit comments

Comments
 (0)