Add ESM2 Finetuning Benchmark Configuration (#964)

nvmvle · jwilber · web-flow · commit 1a1edf02adc5 · 2025-07-30T00:18:48.000-04:00
### Description This PR adds comprehensive benchmark configurations for ESM2 finetuning to support performance testing and validation. The changes introduce two new benchmark configurations (partial-conv and perf) along with enhanced finetuning capabilities including checkpointing control, TensorBoard logging, and TFLOPS measurement callbacks. Key enhancements include: - Added ESM2 finetuning YAML configurations for partial-conv and performance benchmarks - Implemented checkpointing control with `--disable-checkpointing` option for faster benchmark runs - Added TensorBoard logging support for training metrics visualization - Introduced TFLOPS callback option to measure and log computational performance - Enhanced training control parameters including max_steps, early stopping, and batch size configurations ### Type of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Pre-submit Checklist - [x] I have tested these changes locally - [ ] I have updated the documentation accordingly - [x] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: My Le <mvle@nvidia.com> --------- Co-authored-by: Jared Wilber <jwilber@nvidia.com>
diff --git a/ci/benchmarks/partial-conv/esm2_finetuning.yaml b/ci/benchmarks/partial-conv/esm2_finetuning.yaml
@@ -0,0 +1,76 @@
+scope: partial-conv
+time_limit: 14400
+key_segments:
+  # Modify keys to be renamed (str) or excluded (False) from run identifier. By default, all args under script_args are included.
+  train_data_path: False
+  valid_data_path: False
+  data_base_path: False
+  num_workers: False
+  limit_val_batches: False
+  limit_test_batches: False
+  val_check_interval: False
+  dataset_class: False
+  task_type: False
+  config_class: False
+  experiment_name: False
+  workspace: False
+  restore_from_checkpoint_path: False
+script_args:
+  # All arguments referenced in the script string must be specified here.
+  # Arguments not referenced in the script string must have the 'arg' field specified.
+  # See jet/core/configs.py for the specification of the configuration class
+  workspace: /workspace/bionemo2
+  data_base_path: /data/FLIP
+  restore_from_checkpoint_path: /data/esm2_650M_nemo2
+  nodes: [1]
+  gpus: 8
+  model: esm2
+  variant: finetune
+  config_name: 650M
+  precision: [bf16-mixed]
+  num_workers: 8
+  limit_val_batches: 100 # original 1000, 100 is enough for validation and produce good enough curves
+  limit_test_batches: 100
+  task: seq_classification
+  train_data_path: scl/train/x000.csv
+  valid_data_path: scl/val/x000.csv
+  task_type: classification
+  config_class: ESM2FineTuneSeqConfig
+  dataset_class: InMemorySingleValueDataset
+  max_steps: 30000
+  stop_steps: 3000
+  experiment_name: seq-level-classification
+  batch_size: 64
+  val_check_interval: 100
+script: |-
+  WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
+    --train-data-path=${data_base_path}/${train_data_path} \
+    --valid-data-path=${data_base_path}/${valid_data_path} \
+    --restore-from-checkpoint-path=${restore_from_checkpoint_path} \
+    --task-type=${task_type} \
+    --config-class=${config_class} \
+    --dataset-class=${dataset_class} \
+    --num-steps=${max_steps} \
+    --experiment-name=${experiment_name}_${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
+    --lr=0.0005 \
+    --result-dir=${tensorboard_dir} \
+    --micro-batch-size=${batch_size} \
+    --limit-val-batches=${limit_val_batches} \
+    --limit-test-batches=${limit_test_batches} \
+    --precision=${precision} \
+    --label-column=scl_label \
+    --num-gpus=${gpus} \
+    --num-nodes=${nodes} \
+    --accumulate-grad-batches=2 \
+    --val-check-interval=${val_check_interval} \
+    --num-dataset-workers=${num_workers} \
+    --wandb-project=${wandb_project_name} \
+    --wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
+    --create-tensorboard-logger \
+    --encoder-frozen \
+    --mlp-ft-dropout=0.25 \
+    --mlp-hidden-size=256 \
+    --mlp-target-size=10 \
+    --disable-checkpointing \
+    --early-stop-on-step=${stop_steps} \
+    --create-tflops-callback;
diff --git a/ci/benchmarks/perf/esm2_finetuning.yaml b/ci/benchmarks/perf/esm2_finetuning.yaml
@@ -0,0 +1,93 @@
+scope: perf
+time_limit: 3600
+key_segments:
+  # Modify keys to be renamed (str) or excluded (False) from run identifier. By default, all args under script_args are included.
+  train_data_path: False
+  valid_data_path: False
+  data_base_path: False
+  limit_val_batches: False
+  limit_test_batches: False
+  val_check_interval: False
+  dataset_class: False
+  task_type: False
+  config_class: False
+  num_workers: False
+  experiment_name: False
+  workspace: False
+  restore_from_checkpoint_path: False
+script_args:
+  # All arguments referenced in the script string must be specified here.
+  # Arguments not referenced in the script string must have the 'arg' field specified.
+  # See jet/core/configs.py for the specification of the configuration class
+  workspace: /workspace/bionemo2
+  data_base_path: /data/FLIP
+  restore_from_checkpoint_path: /data/esm2_650M_nemo2
+  gpus: 8
+  model: esm2
+  variant: finetune
+  config_name: 650M
+  precision: [bf16-mixed]
+  num_workers: 8
+  limit_val_batches: 1
+  limit_test_batches: 1
+  task: seq_classification
+  train_data_path: scl/train/x000.csv
+  valid_data_path: scl/val/x000.csv
+  task_type: classification
+  config_class: ESM2FineTuneSeqConfig
+  dataset_class: InMemorySingleValueDataset
+  max_steps: 30000
+  stop_steps: 300
+  experiment_name: seq-level-classification
+  val_check_interval: 100
+  products:
+    - nodes: 1
+      batch_size: 16
+      pp: 1
+      tp: 1
+    - nodes: 1
+      batch_size: 64
+      pp: 1
+      tp: 1
+    - nodes: 2
+      batch_size: 16
+      pp: 1
+      tp: 1
+    - nodes: 2
+      batch_size: 64
+      pp: 1
+      tp: 1
+script: |-
+  WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
+    --train-data-path=${data_base_path}/${train_data_path} \
+    --valid-data-path=${data_base_path}/${valid_data_path} \
+    --restore-from-checkpoint-path=${restore_from_checkpoint_path} \
+    --task-type=${task_type} \
+    --config-class=${config_class} \
+    --dataset-class=${dataset_class} \
+    --num-steps=${max_steps} \
+    --experiment-name=${experiment_name}_${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec_tp${tp}_pp_${pp} \
+    --lr=0.0005 \
+    --result-dir=${tensorboard_dir} \
+    --micro-batch-size=${batch_size} \
+    --limit-val-batches=${limit_val_batches} \
+    --limit-test-batches=${limit_test_batches} \
+    --precision=${precision} \
+    --label-column=scl_label \
+    --num-gpus=${gpus} \
+    --num-nodes=${nodes} \
+    --accumulate-grad-batches=1 \
+    --val-check-interval=${val_check_interval} \
+    --num-dataset-workers=${num_workers} \
+    --wandb-project=${wandb_project_name} \
+    --wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
+    --create-tensorboard-logger \
+    --encoder-frozen \
+    --mlp-ft-dropout=0.25 \
+    --mlp-hidden-size=256 \
+    --mlp-target-size=10 \
+    --disable-checkpointing \
+    --pipeline-model-parallel-size=${pp} \
+    --tensor-model-parallel-size=${tp} \
+    --early-stop-on-step=${stop_steps} \
+    --create-tflops-callback;
diff --git a/sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py b/sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py
@@ -26,7 +26,9 @@
 from nemo.collections import llm
 from nemo.lightning import resume
 from nemo.lightning.pytorch import callbacks as nl_callbacks
+from nemo.lightning.pytorch.callbacks.flops_callback import FLOPsMeasurementCallback
 from nemo.lightning.pytorch.optim import MegatronOptimizerModule
+from nemo.utils.exp_manager import TimingCallback
 
 from bionemo.core.utils.dtypes import PrecisionTypes, get_autocast_dtype
 from bionemo.esm2.data.tokenizer import get_tokenizer
@@ -127,7 +129,7 @@ def get_parser():
 
     # Checkpoint parameters
     parser.add_argument("--create-tensorboard-logger", action="store_true", help="Create tensorboard logger")
-    parser.add_argument("--restore-from-checkpoint-path", type=Path, default=None, help="Restore from checkpoint")
+    parser.add_argument("--restore-from-checkpoint-path", type=Path, required=True, help="Restore from checkpoint")
     parser.add_argument("--save-last-checkpoint", action="store_true", default=True, help="Save last checkpoint")
     parser.add_argument(
         "--metric-to-monitor-for-checkpoints", type=str, default="val_loss", help="Metric to monitor for checkpoints"
@@ -166,6 +168,28 @@ def get_parser():
     parser.add_argument("--lora-checkpoint-path", type=Path, default=None, help="LoRA checkpoint path")
     parser.add_argument("--lora-finetune", action="store_true", help="Use LoRA fine-tuning")
 
+    parser.add_argument(
+        "--disable-checkpointing",
+        action="store_false",
+        default=True,
+        dest="create_checkpoint_callback",
+        help="Disable creating a ModelCheckpoint callback.",
+    )
+
+    parser.add_argument(
+        "--early-stop-on-step",
+        type=int,
+        default=None,
+        help="Stop training on this step, if set. This may be useful for testing or debugging purposes.",
+    )
+
+    parser.add_argument(
+        "--create-tflops-callback",
+        action="store_true",
+        default=False,
+        help="Enable tflops calculation callback. Default is False.",
+    )
+
     return parser
 
 
@@ -233,7 +257,10 @@ def train_model(
     labels_mask_column: Optional[str] = None,
     lora_checkpoint_path: Optional[Path] = None,
     lora_finetune: bool = False,
-) -> Tuple[Path, Callback | None, nl.Trainer]:
+    create_checkpoint_callback: bool = True,
+    early_stop_on_step: Optional[int] = None,
+    create_tflops_callback: bool = False,
+) -> Tuple[Optional[Path], Callback | None, nl.Trainer]:
     config_class = SUPPORTED_CONFIGS[config_class]
     dataset_class = SUPPORTED_DATASETS[dataset_class]
 
@@ -298,6 +325,7 @@ def train_model(
         RichModelSummary(max_depth=4),
         LearningRateMonitor(),
         nl_callbacks.PreemptionCallback(),
+        TimingCallback(),
     ]
     if metric_tracker is not None:
         callbacks.append(metric_tracker)
@@ -411,24 +439,44 @@ def train_model(
         initialize_tensorboard_logger=create_tensorboard_logger,
         wandb_config=wandb_config,
     )
-    # Configure our custom Checkpointer
-    checkpoint_path = str(Path(nemo_logger.save_dir) / "checkpoints")
-    checkpoint_callback = nl_callbacks.ModelCheckpoint(
-        dirpath=checkpoint_path,
-        save_last=save_last_checkpoint,
-        monitor=metric_to_monitor_for_checkpoints,  # "val_loss",
-        save_top_k=save_top_k,
-        every_n_train_steps=val_check_interval,
-        always_save_context=True,  # Enables the .nemo file-like checkpointing where all IOMixins are under SerDe
-        filename="checkpoint-{step}-{consumed_samples}",  # Including step and consumed_samples in the checkpoint filename prevents duplicate filenames and bugs related to this.
-        save_weights_only=False,
-        save_optim_on_train_end=True,
-    )
-    callbacks.append(checkpoint_callback)
+
+    if create_checkpoint_callback:
+        # Configure our custom Checkpointer
+        checkpoint_path = str(Path(nemo_logger.save_dir) / "checkpoints")
+        checkpoint_callback = nl_callbacks.ModelCheckpoint(
+            dirpath=checkpoint_path,
+            save_last=save_last_checkpoint,
+            monitor=metric_to_monitor_for_checkpoints,  # "val_loss",
+            save_top_k=save_top_k,
+            every_n_train_steps=val_check_interval,
+            always_save_context=True,  # Enables the .nemo file-like checkpointing where all IOMixins are under SerDe
+            filename="checkpoint-{step}-{consumed_samples}",  # Including step and consumed_samples in the checkpoint filename prevents duplicate filenames and bugs related to this.
+            save_weights_only=False,
+            save_optim_on_train_end=True,
+        )
+        callbacks.append(checkpoint_callback)
+        auto_resume = resume.AutoResume(
+            resume_from_directory=checkpoint_path,
+            resume_if_exists=resume_if_exists,  # Looks for the -last checkpoint to continue training.
+            resume_ignore_no_checkpoint=True,  # When false this will throw an error with no existing checkpoint.
+            resume_past_end=False,
+        )
+    else:
+        auto_resume = None
+
+    if create_tflops_callback:
+        # Add callback that logs the tera-FLOPS per second per GPU during training.
+        data_module.global_batch_size = global_batch_size
+        flop_meas_callback = FLOPsMeasurementCallback(
+            config,
+            data_module,
+            "bert",
+        )
+        callbacks.append(flop_meas_callback)
 
     trainer = nl.Trainer(
         devices=num_gpus,
-        max_steps=num_steps,
+        max_steps=num_steps if early_stop_on_step is None else early_stop_on_step,
         max_epochs=max_epochs,
         accelerator="gpu",
         strategy=strategy,
@@ -445,21 +493,17 @@ def train_model(
             grad_reduce_in_fp32=grad_reduce_in_fp32,
             autocast_enabled=False,
         ),
-        enable_checkpointing=True,
+        enable_checkpointing=create_checkpoint_callback,
     )
     llm.train(
         model=module,
         data=data_module,
         trainer=trainer,
         log=nemo_logger,
-        resume=resume.AutoResume(
-            resume_from_directory=checkpoint_path,
-            resume_if_exists=resume_if_exists,  # Looks for the -last checkpoint to continue training.
-            resume_ignore_no_checkpoint=True,  # When false this will throw an error with no existing checkpoint.
-        ),
+        resume=auto_resume,
     )
 
-    ckpt_path = Path(checkpoint_callback.last_model_path.replace(".ckpt", ""))
+    ckpt_path = Path(checkpoint_callback.last_model_path.replace(".ckpt", "")) if create_checkpoint_callback else None
     return ckpt_path, metric_tracker, trainer
 
 
diff --git a/sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_finetune_and_infer_esm2.py b/sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_finetune_and_infer_esm2.py