Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 114 additions & 7 deletions bionemo-recipes/recipes/evo2_megatron/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,35 +19,142 @@ uv pip install -c pip-constraints.txt -e . --no-build-isolation

## Usage

### Example job

```
# 3. Run an example job
## 2. if on a6000s, you may need to disable p2p to avoid crashing
export NCCL_P2P_DISABLE=1
## 3. Run the job:
torchrun --nproc-per-node 8 --no-python \
torchrun --nproc-per-node 2 --no-python \
train_evo2 \
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256 \
--model-size striped_hyena_1b_nv_parallel --max-steps 12 --eval-interval 10 \
--eval-iters 3 --mock-data \
--micro-batch-size 32 --global-batch-size 256 --seq-length 1024 \
--micro-batch-size 16 --global-batch-size 32 --seq-length 1024 \
--tensor-model-parallel 1 \
--use-precision-aware-optimizer --dataset-seed 33 \
--seed 41 --ckpt-async-save --spike-no-more-embedding-init \
--seed 41 --spike-no-more-embedding-init \
--no-weight-decay-embeddings --cross-entropy-loss-fusion \
--align-param-gather --overlap-param-gather --grad-reduce-in-fp32 \
--decay-steps 100 --warmup-steps 10 \
--mixed-precision-recipe bf16-mixed \
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
--no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 \
--attention-dropout 0.001 --hidden-dropout 0.001 \
--eod-pad-in-loss-mask --enable-preemption \
--log-interval 5 --debug-ddp-parity-freq 10 \
--wandb-project evo2-recipes-verification-tmp \
--wandb-run-name tmp_workstation_run_mock_data \
--result-dir tmpbf16 --no-renormalize-loss
--result-dir tmpfp8 --no-renormalize-loss
```

### Example fine-tune from an existing checkpoint

First convert the checkpoint from nemo2 format (temporary step until we upload the new files)

Good checkpoint names to try are:

- evo2/1b-8k-bf16:1.0 (model_size: 1b)
- evo2/7b-1m:1.0 (model_size: 7b_arc_longcontext)
- evo2/40b-1m-fp8-bf16:1.0 (model_size: 40b_arc_longcontext)

Other than the 7b version, the other two are checkpoints fine-tuned by the BioNeMo team to support both FP8 and BF16
precision. The 7b version worked well on both FP8 and BF16 out of the box so it was not fine-tuned further. If you do
want to use one of the FP8 sensitive checkpoints, like `evo2/40b-1m` then be sure to add the `--vortex-style-fp8`
option to the checkpoint conversion step below. Also note that although 8k versions of the 7b and 40b checkpoints exist,
it is advisable to use the longer context versions since they were trained further and still run on shorter inputs.

See `download_bionemo_data --list-resources` for other checkpoint options and a list of available
downloadable resources.

```
CKPT_NAME=evo2/1b-8k-bf16:1.0
CKPT_OUT_DIR=evo2_1b_8k_bf16_mbridge
evo2_convert_nemo2_to_mbridge \
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
--tokenizer-path tokenizers/nucleotide_fast_tokenizer_512 \
--model-size 1b \
--seq-length 8192 \
--nemo2-ckpt-dir $(download_bionemo_data $CKPT_NAME) \
--mbridge-ckpt-dir $CKPT_OUT_DIR

```

Now run like before, but include the fine-tuned checkpoint directory you converted in the previous step with
`--finetune-ckpt-dir $CKPT_OUT_DIR`. Also if you have problems with `bf16_with_fp8_current_scaling_mixed` try
`bf16_mixed`.

```
torchrun --nproc-per-node 2 --no-python \
train_evo2 \
--hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_512 \
--model-size 1b --max-steps 12 --eval-interval 10 \
--eval-iters 3 --mock-data \
--micro-batch-size 16 --global-batch-size 32 --seq-length 1024 \
--tensor-model-parallel 1 \
--use-precision-aware-optimizer --dataset-seed 33 \
--seed 41 \
--cross-entropy-loss-fusion \
--align-param-gather --overlap-param-gather --grad-reduce-in-fp32 \
--decay-steps 100 --warmup-steps 10 \
--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed \
--no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1 \
--attention-dropout 0.001 --hidden-dropout 0.001 \
--eod-pad-in-loss-mask --enable-preemption \
--log-interval 5 --debug-ddp-parity-freq 10 \
--result-dir tmpfp8-ft-example --no-renormalize-loss \
--finetune-ckpt-dir $CKPT_OUT_DIR
```

## Where do the custom command line programs come from?

See `pyproject.toml` for where runnable programs like `train_evo2` and `evo2_convert_nemo2_to_mbridge` are implemented
in code.

## Docker build

```
docker build -t evo2_megatron_recipe-$(git rev-parse --short HEAD) .
```

## Performance and accuracy comparisons

NOTE: this section is largely a work in progress. This reflects the most updated information, but may not reflect the
current state of the code base at any given time.

### Training accuracy convergence

We ran a 12 hour 48 H100 GPU training run to compare megatron bridge with nemo2. We found that FP8 current scaling
converges by around the 5,000th step to the bf16 lines. And that bf16 is comparable with nemo2. Interestingly in nemo2
bf16 and fp8 followed nearly identical trajectories for the first 5k steps as well. Note that in a typical training run
we are performing over 100k steps, so different behavior in the first 5k steps is less worrisome if the endpoints are
comparable.

![Training Convergence Comparison](assets/mbridge_to_nemo_training_convergence_7ksteps.png)

### Training performance comparisons

FP8 current scaling which is supposed to have better convergence properties than delayed scaling, performs nearly as
well as delayed scaling in mbridge. Even leaving multiple transformer layers in bf16 precision trains faster than fp8
delayed scaling in nemo2.

| Evo2 1B Run | Seconds per step (lower is better) | Tokens/sec/GPU | Global Batch Size | Number of GPUs | Vocab Size |
| :----------------------------------------------: | :--------------------------------: | :------------: | :---------------: | :------------: | :--------: |
| MBridge BF16 | 6.10 | 26,859 | 960 | 48 | 256 |
| MBridge FP8 (delayed) | 5.38 | 30,453 | 960 | 48 | 256 |
| MBridge FP8 (current) | 5.44 | 28,755 | 960 | 48 | 512 |
| MBridge FP8 (current first/last two layers bf16) | 5.47 | 28,598 | 960 | 48 | 512 |
| Nemo2 FP8 (delayed) | 6.18 | 26,511 | 960 | 48 | 512 |

Activation memory optimizations have enabled context parallelism to work better with evo2 style models in our mbridge
implementation than the previous nemo2 implementation. Since TP requires more node to node communication, you generally
want to limit TP to your fastest interconnects, which are typically configured in nodes of 8 GPUs. Evo2 would previously
OOM with these more ideal configurations, requiring much larger than typical levels of TP to handle long context
training. With our latest changes to the evo2 forward pass, we can now handle more typical TP vs CP configurations.
This enables significantly faster step timing at long context, as well as demonstrating up to 2M context length. We
have currently demonstrated small training runs at 2M context on only 512 H100 GPUs for the 40b parameter model.

| Configuration | Precision | TP | CP | Number of Nodes | Number of GPUs | Context Length | Global Batch Size | Seconds per Step |
| :---------------: | :---------: | :-: | :-: | :-------------: | :------------: | :------------: | :---------------: | :--------------: |
| NeMo2 | fp8-delayed | 64 | 2 | 32 | 256 | 1M | 2 | 44 |
| NeMo2 | fp8-delayed | 8 | 16 | 32 | 256 | 1M | 2 | OOM |
| MBridge Optimized | bf16 | 8 | 16 | 32 | 256 | 1M | 2 | 30 |
| 2M Stress Test | bf16 | 8 | 32 | 64 | 512 | 2M | 2 | 48 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions bionemo-recipes/recipes/evo2_megatron/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ train_evo2 = "bionemo.evo2.run.train:main"
#predict_evo2 = "bionemo.evo2.run.predict:main"
preprocess_evo2 = "bionemo.evo2.data.preprocess:main"
splice_evo2 = "bionemo.evo2.data.transcript_extraction:main"
evo2_convert_nemo2_to_mbridge = "bionemo.evo2.utils.checkpoint.nemo2_to_mbridge:main"
#evo2_convert_to_nemo2 = "bionemo.evo2.utils.checkpoint.convert_to_nemo:main"
#evo2_nemo2_to_hf = "bionemo.evo2.utils.checkpoint.nemo2_to_hf:main"
#evo2_remove_optimizer = "bionemo.evo2.utils.checkpoint.evo2_remove_optimizer:main"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,6 @@ def _evo2_common(
),
tokenizer=TokenizerConfig(
tokenizer_type="HuggingFaceTokenizer",
hf_tokenizer_kwargs={"trust_remote_code": True},
tokenizer_model=hf_tokenizer_model_or_path or "EleutherAI/gpt-neox-20b",
),
checkpoint=CheckpointConfig(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -710,9 +710,9 @@ def train(args: argparse.Namespace) -> None:
recipe_kwargs["stride"] = args.stride
recipe_kwargs["window_min_length_threshold"] = args.window_min_length_threshold
recipe_kwargs["rc_aug"] = args.rc_aug
elif args.dataset_config_path:
elif args.dataset_config:
recipe_kwargs["dataset_dir"] = args.dataset_dir
recipe_kwargs["dataset_config_path"] = args.dataset_config_path
recipe_kwargs["dataset_config_path"] = args.dataset_config

recipe_kwargs["pad_eod_loss_mask"] = args.eod_pad_in_loss_mask

Expand Down Expand Up @@ -918,6 +918,7 @@ def train(args: argparse.Namespace) -> None:
if args.finetune_ckpt_dir:
cfg.checkpoint.finetune = True
cfg.checkpoint.pretrained_checkpoint = args.finetune_ckpt_dir
cfg.checkpoint.dist_ckpt_strictness = "ignore_all" # necessary unfortunately to avoid extra_state issues.
if args.nvidia_fault_tolerance:
cfg.ft = FaultToleranceConfig(
enable_ft_package=True,
Expand Down
Loading