diff --git a/bionemo-recipes/README.md b/bionemo-recipes/README.md
index cbfe065d2d..c88ed5050b 100644
--- a/bionemo-recipes/README.md
+++ b/bionemo-recipes/README.md
@@ -1,15 +1,15 @@
# BioNeMo Recipes
-BioNeMo Recipes provides an easy path for the biological foundation model training community to scale up transformer-based models efficiently. Rather than offering a batteries-included training framework, we provide **model checkpoints** with TransformerEngine (TE) layers and **training recipes** that demonstrate how to achieve maximum throughput with popular open-source frameworks and fully sharded data parallel (FSDP) scale-out.
+BioNeMo Recipes provides an easy path for the biological foundation model training community to scale up transformer-based models efficiently. Rather than offering a batteries-included training framework, BioNeMo Recipes provide **model checkpoints** with TransformerEngine (TE) layers and **training recipes** that demonstrate how to achieve maximum throughput with popular open-source frameworks and fully sharded data parallel (FSDP) scale-out.
## Overview
-The biological AI community is actively prototyping model architectures and needs tooling that prioritizes extensibility, interoperability, and ease-of-use alongside performance. BioNeMo Recipes addresses this by offering:
+The biological AI community actively prototypes model architectures and needs tooling that prioritizes extensibility, interoperability, and ease-of-use, alongside performance. BioNeMo Recipes addresses this by offering:
-- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations
+- **Flexible scaling**: Scales from single-GPU prototyping to multi-node training without complex parallelism configurations
- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch
- **Performance optimization**: Leverages TransformerEngine and megatron-FSDP for state-of-the-art training efficiency
-- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments
+- **Research-friendly**: Contains hackable and readable code that researchers can easily adapt for their experiments
### Performance Benchmarks
@@ -21,6 +21,8 @@ The biological AI community is actively prototyping model architectures and need
### Use Cases
+The use cases of BioNeMO Recipes include:
+
- **Foundation Model Developers**: AI researchers and ML engineers developing novel biological foundation models who need to scale up prototypes efficiently
- **Foundation Model Customizers**: Domain scientists looking to fine-tune existing models with proprietary data for drug discovery and biological research
@@ -48,9 +50,9 @@ Abbreviations:
- BF16: [brain-float 16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), a common 16 bit float format for deep learning.
- FP8[1]: [8-bit floating point](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html), a compact format for weights allowing for faster training and inference.
- MXFP8[2]: [Multi Scale 8-bit floating point](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html), as compact as FP8 but with better numerical precision.
-- NVFP4[2]: [NVIDIA 4-bit floating point](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Beyond-FP8---training-with-NVFP4), faster than FP8, retaining accuracy via multi-scale.
-- THD: **T**otal **H**eads **D**imension, also known as ["sequence packing"](https://docs.nvidia.com/nemo-framework/user-guide/24.07/nemotoolkit/features/optimizations/sequence_packing.html#sequence-packing-for-sft-peft). A way to construct a batch with sequences of different length so there are no pads, therefore no compute is wasted on computing attention for padding tokens. This is in contrast to **B**atch **S**equence **H**ead **D**imension (BSHD) format, which uses pads to create a rectangular batch.
-- CP: Context parallel, also known as sequence parallel. A way to distribute the memory required to process long sequences across multiple GPUs. For more information please see [context parallel](./recipes/context_parallel.md)
+- NVFP4[2]: [NVIDIA 4-bit floating point](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Beyond-FP8---training-with-NVFP4), faster than FP8, retaining accuracy using multi-scale.
+- THD: **T**otal **H**eads **D**imension, also known as ["sequence packing"](https://docs.nvidia.com/nemo-framework/user-guide/24.07/nemotoolkit/features/optimizations/sequence_packing.html#sequence-packing-for-sft-peft). A way to construct a batch with sequences of different lengths so there are no pads, which results in no compute wasted on computing attention for padding tokens. This is in contrast to **B**atch **S**equence **H**ead **D**imension (BSHD) format, which uses pads to create a rectangular batch.
+- CP: Context parallel, also known as sequence parallel. A way to distribute the memory required to process long sequences across multiple GPUs. For more information, refer to [context parallel](./recipes/context_parallel.md)
\[1\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 9.0 and above (Hopper+)
\[2\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 10.0 and 10.3 (Blackwell), 12.0 support pending
@@ -63,7 +65,7 @@ This repository contains two types of components:
Huggingface-compatible `PreTrainedModel` classes that use TransformerEngine layers internally. These are designed to be:
-- **Distributed via Hugging Face Hub**: Pre-converted checkpoints available at [huggingface.co/nvidia](https://huggingface.co/nvidia)
+- **Distributed through Hugging Face Hub**: Pre-converted checkpoints available at [huggingface.co/nvidia](https://huggingface.co/nvidia)
- **Drop-in replacements**: Compatible with `AutoModel.from_pretrained()` without additional dependencies
- **Performance optimized**: Leverage TransformerEngine features like FP8 training and context parallelism
@@ -82,7 +84,11 @@ Recipes are **not pip-installable packages** but serve as reference implementati
## Quick Start
-### Using Models
+This section describe how you can get started with BioNeMo Recipes.
+
+### Loading Models
+
+Run the following to load the BioNeMo model.
```python
from transformers import AutoModel, AutoTokenizer
@@ -94,6 +100,8 @@ tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M")
### Running Recipes
+Build and run recipes with the following.
+
```bash
# Navigate to a recipe
cd recipes/esm2_native_te_mfsdp
@@ -103,13 +111,9 @@ docker build -t esm2_recipe .
docker run --rm -it --gpus all esm2_recipe python train.py
```
-______________________________________________________________________
+## Setting Up the Development Environment
-## Developer Guide
-
-### Setting Up Development Environment
-
-1. **Install pre-commit hooks:**
+1. Install pre-commit hooks:
```bash
pre-commit install
@@ -130,9 +134,9 @@ ______________________________________________________________________
docker run --rm -it --gpus all my_tag pytest -v .
```
-### Coding Guidelines
+## Coding Guidelines
-We prioritize **readability and simplicity** over comprehensive feature coverage:
+BioNeMo Recipes prioritize **readability and simplicity** over comprehensive feature coverage:
- **KISS (Keep It Simple) over DRY (Don't Repeat Yourself)**: It's better to have clear, duplicated code than complex
abstractions
@@ -141,7 +145,7 @@ We prioritize **readability and simplicity** over comprehensive feature coverage
### Testing Strategy
-We use a three-tier testing approach:
+BioNeMo Reciptes use a three-tier testing approach:
#### L0 Tests (Pre-merge)
@@ -166,9 +170,11 @@ We use a three-tier testing approach:
### Adding New Components
+With BioNeMo Recipes, you can add new components including models and recipes.
+
#### Adding a New Model
-Models should be pip-installable packages that can export checkpoints to Hugging Face. See the
+Models should be pip-installable packages that can export checkpoints to Hugging Face. Refer to the
[models README](models/README.md) for detailed guidelines on:
- Package structure and conventions
@@ -178,7 +184,7 @@ Models should be pip-installable packages that can export checkpoints to Hugging
#### Adding a New Recipe
-Recipes should be self-contained Docker environments demonstrating specific training patterns. See
+Recipes should be self-contained Docker environments demonstrating specific training patterns. Refer to
the [recipes README](recipes/README.md) for guidance on:
- Directory structure and naming
@@ -209,14 +215,14 @@ We aim to provide the fastest available training implementations for biological
## Contributing
-We welcome contributions that advance the state of biological foundation model training. Please ensure your contributions:
+We welcome contributions that advance the state of biological foundation model training. Ensure your contributions:
-1. Follow our coding guidelines emphasizing clarity
-2. Include appropriate tests (L0 minimum, L1/L2 as applicable)
-3. Provide clear documentation and examples
-4. Maintain compatibility with our supported frameworks
+- Follow our coding guidelines emphasizing clarity
+- Include appropriate tests (L0 minimum, L1/L2 as applicable)
+- Provide clear documentation and examples
+- Maintain compatibility with our supported frameworks
-For detailed contribution guidelines, see our individual component READMEs:
+For detailed contribution guidelines, refer to our individual component READMEs:
- [Models Development Guide](models/README.md)
- [Recipes Development Guide](recipes/README.md)
diff --git a/bionemo-recipes/models/README.md b/bionemo-recipes/models/README.md
index 50aad66870..ee20a10782 100644
--- a/bionemo-recipes/models/README.md
+++ b/bionemo-recipes/models/README.md
@@ -1,14 +1,14 @@
# Models Directory
-This directory contains HuggingFace-compatible model implementations that use TransformerEngine layers internally. These models are designed to be distributed via the Hugging Face Hub and serve as drop-in replacements for standard transformer models with enhanced performance.
+This directory contains HuggingFace-compatible model implementations that use TransformerEngine layers internally. These models are designed to be distributed through the Hugging Face Hub and serve as drop-in replacements for standard transformer models with enhanced performance.
## Overview
Models in this directory are **not intended to be pip-installed directly**. Instead, they serve as:
-1. **Reference implementations** of biological foundation models using TransformerEngine
-2. **Conversion utilities** for transforming existing model checkpoints to TE-compatible format
-3. **Export tools** for preparing model releases on the Hugging Face Hub
+- **Reference implementations** of biological foundation models using TransformerEngine
+- **Conversion utilities** for transforming existing model checkpoints to TE-compatible format
+- **Export tools** for preparing model releases on the Hugging Face Hub
Users will typically interact with these models by loading pre-converted checkpoints directly from the Hugging Face Hub using standard transformers APIs.
@@ -33,7 +33,7 @@ To add a new model to this directory, you must provide:
#### 3. Checkpoint Export Script
- **`export.py`**: Script that packages all necessary files for Hugging Face Hub upload
-- **Complete asset bundling**: Must include all required files (see [Export Requirements](#export-requirements))
+- **Complete asset bundling**: Must include all required files, refer to [Export Requirements](#export-requirements)
- **Automated process**: Should be runnable with minimal manual intervention
#### 4. Open Source License
diff --git a/bionemo-recipes/models/amplify/README.md b/bionemo-recipes/models/amplify/README.md
index 0d7f74260b..a1f7e8fe70 100644
--- a/bionemo-recipes/models/amplify/README.md
+++ b/bionemo-recipes/models/amplify/README.md
@@ -1,9 +1,8 @@
# AMPLIFY Optimized with NVIDIA TransformerEngine
This folder contains source code and tests for an AMPLIFY model that inherits from the transformers `PreTrainedModel`
-class and uses TransformerEngine layers. Users don't need to install this package directly, but can load the
-model directly from HuggingFace Hub using the standard transformers API (see [Inference Examples](#inference-examples)
-below).
+class and uses TransformerEngine layers. Users do not need to install this package directly, but can load the
+model directly from HuggingFace Hub using the standard transformers API. For more information, refer to [Inference Examples](#inference-examples).
## Feature support
@@ -18,7 +17,7 @@ The AMPLIFY implementation natively supports the following TransformerEngine-pro
| **Import from HuggingFace checkpoints** | ✅ Supported |
| **Export to HuggingFace checkpoints** | 🚧 Under development |
-See [BioNeMo Recipes](../../recipes/README.md) for more details on how to use these features to accelerate model
+Refer to [BioNeMo Recipes](../../recipes/README.md) for more details on how to use these features to accelerate model
training and inference.
## Links to HF checkpoints
@@ -34,7 +33,7 @@ Pre-trained AMPLIFY models are available on HuggingFace as part of the NVIDIA
## Runtime Requirements
We recommend using the latest [NVIDIA PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
-for optimal performance and compatibility. See the provided Dockerfile for details.
+for optimal performance and compatibility. Refer to the provided Dockerfile for details.
## Inference Examples
@@ -61,7 +60,7 @@ output = model(**inputs)
## Recipe Links
Training recipes are available in the `bionemo-recipes/recipes/` directory. AMPLIFY can be trained using the same
-recipes as ESM-2, simply by switching the model_tag to reference the AMPLIFY model, e.g. `nvidia/AMPLIFY_120M`, and
+recipes as ESM-2, simply by switching the model_tag to reference the AMPLIFY model, such as `nvidia/AMPLIFY_120M`, and
changing the dataset as appropriate.
- **[esm2_native_te](../../recipes/esm2_native_te/)** - Demonstrates training with a simple native PyTorch training
@@ -118,3 +117,5 @@ Or, upload all models at once with:
```bash
for dir in *; do huggingface-cli upload nvidia/$(basename "$dir") "$dir/"; done
```
+
+z
diff --git a/bionemo-recipes/models/esm2/README.md b/bionemo-recipes/models/esm2/README.md
index f0f176d173..be2daae3a4 100644
--- a/bionemo-recipes/models/esm2/README.md
+++ b/bionemo-recipes/models/esm2/README.md
@@ -2,8 +2,7 @@
This folder contains source code and tests for an ESM-2 model that inherits from the transformers `PreTrainedModel`
class and uses TransformerEngine layers. Users don't need to install this package directly, but can load the
-model directly from HuggingFace Hub using the standard transformers API (see [Inference Examples](#inference-examples)
-below).
+model directly from HuggingFace Hub using the standard transformers API. For more information, refer to [Inference Examples](#inference-examples).
## Feature support
@@ -18,7 +17,7 @@ The ESM-2 implementation natively supports the following TransformerEngine-provi
| **Import from HuggingFace checkpoints** | ✅ Supported |
| **Export to HuggingFace checkpoints** | ✅ Supported |
-See [BioNemo Recipes](../../recipes/README.md) for more details on how to use these features to accelerate model
+Refer to [BioNemo Recipes](../../recipes/README.md) for more details on how to use these features to accelerate model
training and inference.
## Links to HF checkpoints
@@ -38,7 +37,7 @@ Pre-trained ESM-2 models converted from the original Facebook weights are availa
## Runtime Requirements
We recommend using the latest [NVIDIA PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
-for optimal performance and compatibility. See the provided Dockerfile for details.
+for optimal performance and compatibility. Refer to the provided Dockerfile for details.
## Inference Examples
@@ -101,7 +100,7 @@ hf_model = convert_esm_te_to_hf(te_model)
hf_model.save_pretrained("/path/to/hf_checkpoint")
```
-Load and Test the Exported Model
+### Loading and Testing the Exported Model
Load the exported model and perform validation:
@@ -114,8 +113,8 @@ tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
### Validating Converted Models
-See the commands in [Inference Examples](#inference-examples) above to load and test both the original and converted
-models to ensure loss and logit values are similar. See also the golden value tests in
+To validate the converted models, refer to the commands in [Inference Examples](#inference-examples) above to load and test both the original and converted
+models to ensure loss and logit values are similar. Additionally, refer to the golden value tests in
[test_modeling_esm_te.py](tests/test_modeling_esm_te.py) and [test_convert.py](tests/test_convert.py).
## Developer Guide
@@ -153,7 +152,7 @@ Now deploy the converted checkpoints to the HuggingFace Hub by running the follo
huggingface-cli upload nvidia/${MODEL_NAME} $PWD/checkpoint_export/${MODEL_NAME}
```
-Or, upload all models at once with:
+You can also upload all models at once with:
```bash
cd checkpoint_export
diff --git a/bionemo-recipes/recipes/README.md b/bionemo-recipes/recipes/README.md
index 459ab0418d..a1b94238ba 100644
--- a/bionemo-recipes/recipes/README.md
+++ b/bionemo-recipes/recipes/README.md
@@ -80,7 +80,7 @@ recipes/{recipe_name}/
## Implementation Requirements
-### 1. Self-Contained Docker Environment
+### Self-Contained Docker Environment
Your `Dockerfile` should create a complete, reproducible training environment:
@@ -107,7 +107,7 @@ CMD ["/bin/bash"]
- Include everything needed to run training without external dependencies
- Optimize for Docker layer caching
-### 2. Readable Training Scripts
+### Readable Training Scripts
Your `train.py` should be educational and self-explanatory:
@@ -166,7 +166,7 @@ if __name__ == "__main__":
- **Error handling** for common failure modes
- **Progress logging** so users understand training status
-### 3. Hydra Configuration Management
+### Hydra Configuration Management
Use Hydra for clean, hierarchical configuration management:
@@ -252,7 +252,9 @@ model = MyModel(**config.model_kwargs)
optimizer = AdamW(**config.optimizer_kwargs)
```
-### 4. Comprehensive Testing
+### Comprehensive Testing
+
+Ensure the following tests are done when implementing.
#### L0 Tests - Fast CI/CD Validation
@@ -317,11 +319,11 @@ def test_accelerate_launch(accelerate_config, tmp_path):
#### L1 Benchmark Tests
-L1 tests should be specified via a `L1_{model_name}_{test_type}.yaml` config file. These should be
-workflows that will be launched via a common SLURM or Lepton batch script, complete in under 4
+L1 tests should be specified using a `L1_{model_name}_{test_type}.yaml` config file. These should be
+workflows that will be launched using a common SLURM or Lepton batch script, complete in under 4
hours, and have a clear set of performance metrics to validate.
-### 5. Cluster-Agnostic SLURM Script
+### Cluster-Agnostic SLURM Script
Provide a reference SLURM script that works across different cluster configurations:
@@ -368,15 +370,15 @@ srun \
Each recipe must include a detailed README.md covering:
-1. **What it demonstrates**: Clear statement of the training techniques shown
-2. **Hardware requirements**: Minimum and recommended GPU configurations
-3. **Performance expectations**: Benchmark results on reference hardware
-4. **Configuration options**: How to modify the recipe for different use cases
-5. **Troubleshooting**: Common issues and solutions
+- **What it demonstrates**: Clear statement of the training techniques shown
+- **Hardware requirements**: Minimum and recommended GPU configurations
+- **Performance expectations**: Benchmark results on reference hardware
+- **Configuration options**: How to modify the recipe for different use cases
+- **Troubleshooting**: Common issues and solutions
### Performance Benchmarking
-Document performance metrics for your recipe, e.g.
+Document performance metrics for your recipe, for example:
```markdown
## Performance Benchmarks
@@ -417,10 +419,10 @@ For reference implementations, examine existing recipes:
### What Makes a Great Recipe
-1. **Educational value**: Users learn something new about scaling biological models
-2. **Production relevance**: Techniques are applicable to real research workflows
-3. **Performance validation**: Benchmarked results demonstrate clear benefits
-4. **Adaptation friendly**: Users can easily modify for their specific needs
+- **Educational value**: Users learn something new about scaling biological models
+- **Production relevance**: Techniques are applicable to real research workflows
+- **Performance validation**: Benchmarked results demonstrate clear benefits
+- **Adaptation friendly**: Users can easily modify for their specific needs
### Common Pitfalls to Avoid
diff --git a/bionemo-recipes/recipes/codonfm_ptl_te/README.md b/bionemo-recipes/recipes/codonfm_ptl_te/README.md
index f76b4e31d1..5ef707829f 100644
--- a/bionemo-recipes/recipes/codonfm_ptl_te/README.md
+++ b/bionemo-recipes/recipes/codonfm_ptl_te/README.md
@@ -1,4 +1,4 @@
-> Disclaimer: This is an isolated model recipe based on PyTorch Lightning which requires its own dockerized environment -- in the local folder - to be run successfully.
+> Disclaimer: This is an isolated model recipe based on PyTorch Lightning, which requires its own dockerized environment -- in the local folder - to be run successfully.
# Codon FM: Foundation Models for Codon Sequences
@@ -9,33 +9,9 @@ We release the entire codebase, pre-training/finetuning scripts, evaluation jupy
## Origin
This recipe offers [NVIDIA Transformer Engine (TE)](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html) accelerated code for training and inference in addition to the original PyTorch workflow. Hence, the folder structure and most of the code is copied from the original PyTorch based research repository
-https://github.com/NVIDIA-Digital-Bio/CodonFM based on the paper [https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf](https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf). We also provide a checkpoint conversion script between PyTorch and TransformerEngine architecture.
-
-## Table of Contents
-
-- [Codon FM: Foundation Models for Codon Sequences](#codon-fm-foundation-models-for-codon-sequences)
- - [Origin](#origin)
- - [Table of Contents](#table-of-contents)
- - [Pre-trained Models](#pre-trained-models)
- - [Repository Structure](#repository-structure)
- - [NVIDIA TransformerEngine Optimization Benchmarks](#nvidia-transformerengine-optimization-benchmarks)
- - [Quickstart](#quickstart)
- - [1. Clone the repository](#1-clone-the-repository)
- - [2. Docker Setup](#2-docker-setup)
- - [Evaluation Notebooks](#evaluation-notebooks)
- - [Data Preparation](#data-preparation)
- - [Pre-training Dataset](#pre-training-dataset)
- - [Evaluation Datasets](#evaluation-datasets)
- - [Running Training/Finetuning/Evaluation](#running-trainingfinetuningevaluation)
- - [Pre-training](#pre-training)
- - [Fine-tuning](#fine-tuning)
- - [Evaluation](#evaluation)
- - [Checkpoint conversion between PyTorch and TE](#checkpoint-conversion-between-pytorch-and-te)
- - [Using Weights and Biases with CodonFM](#using-weights-and-biases-with-codonfm)
- - [Experiment launch scripts](#experiment-launch-scripts)
- - [License](#license)
-
-## Pre-trained Models
+https://github.com/NVIDIA-Digital-Bio/CodonFM, based on the paper [https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf](https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf). We also provide a checkpoint conversion script between PyTorch and TransformerEngine architecture.
+
+## Pre-Trained Models
The table below summarizes the set of open source pre-trained weights currently made available. All of the training scripts are contained in the directory `experiment_scripts/pretraining/encodon_filtered/`.
@@ -84,7 +60,11 @@ Several Encodon model versions are benchmarked: The first is the original [resea
The SPDA and TransformerEngine implementations are available in this codebase:
1. The default is the PyTorch native transformer based model with SDPA attention implementation.
-2. Transformer Engine (TE) acceleration which is enabled with `--use_transformer_engine` in `runner.py`. This can also be seen below in our sample commands. Moreover, if you would like to increase training performance please enable THD sequence packing, use `--attn_input_format=thd`, and `--collate_fn=thd`. For more information on sequence packing see here [link](https://huggingface.co/blog/sirluk/llm-sequence-packing). The custom TE-based model definition is located here `src/models/components/encodon_te_layer.py` and encapsulated within the `TETransformerLayer`. There are two "flavors" of TE Encodon models available: (1) "Exact", which is an exact reproduction of the original research code architecture, and a (2) "Non-Exact" variant, which uses a different implementation of a transformer that is native to the TE library (differing in LayerNorms), ang gives similar scientific accuracy but with a simpler and fewer lines-of-code implementation of the model. The default and recommended version is the "exact" version, which is the default and can be toggled using the environment variable `CODON_FM_TE_IMPL=exact`.
+2. Transformer Engine (TE) acceleration that is enabled with `--use_transformer_engine` in `runner.py`. This can also be seen below in our sample commands. Moreover, if you would like to increase training performance, enable THD sequence packing, use `--attn_input_format=thd` and `--collate_fn=thd`. For more information on sequence packing refer to this [link](https://huggingface.co/blog/sirluk/llm-sequence-packing). The custom TE-based model definition is located here `src/models/components/encodon_te_layer.py` and encapsulated within the `TETransformerLayer`. There are two "flavors" of TE Encodon models available:
+
+- **Exact**: An exact reproduction of the original research code architecture
+- **Non-Exact**: A variant that uses a different implementation of a transformer that is native to the TE library (differing in LayerNorms), and gives similar scientific accuracy but with a simpler and fewer lines-of-code implementation of the model.
+ The default and recommended version is the "exact" version, which is the default and can be toggled using the environment variable `CODON_FM_TE_IMPL=exact`.
Advanced: "Non-exact" TE Implementation (Optional)
@@ -99,7 +79,7 @@ The training step speedups for the 80M Encodon model when both Transformer Engin

-For inference we can also demonstrate acceleration when using each models TE counterpart. Thus, a 1.4X speedup in this chart shows how much faster the TE version of the model is over the original baseline PyTorch SDPA model.
+For inferencing, we can also demonstrate acceleration when using each models TE counterpart. Thus, a 1.4X speedup in this chart shows how much faster the TE version of the model is over the original baseline PyTorch SDPA model.

## Quickstart
@@ -138,11 +118,11 @@ bash run_dev.sh --data-dir /path/to/your/data --checkpoints-dir /path/to/your/ch
You will be dropped into a `bash` shell inside the container as a non-root user.
-You can also use VSCode `./.devcontainer`. Make sure to mount your data and checkpoints by editing `./devcontainer/devcontainer.json`.
+You can also use the VSCode `./.devcontainer`. Ensure you mount your data and checkpoints by editing `./devcontainer/devcontainer.json`.
#### Evaluation Notebooks
-A series of notebooks are provided in the [notebooks](notebooks) directory show casing multiple use cases such as zero-shot variant prediction and finetuning on downstream tasks. See a brief overview below:
+A series of notebooks are provided in the [notebooks](notebooks) directory show casing multiple use cases such as zero-shot variant prediction and finetuning on downstream tasks. The following is a brief overview:
| Notebook | Description |
| ------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -159,7 +139,7 @@ A series of notebooks are provided in the [notebooks](notebooks) directory show
#### Pre-training Dataset
-In order to create the data required for pretraining, please follow the guidance outlined in [data_scripts/data_curation/README](./data_scripts/data_curation/README)
+In order to create the data required for pretraining, follow the guidance outlined in [data_scripts/data_curation/README](./data_scripts/data_curation/README).
#### Evaluation Datasets
@@ -169,7 +149,7 @@ In order to create the data required for pretraining, please follow the guidance
- Open and run the notebook `notebooks/4-EnCodon-Downstream-Task-riboNN.ipynb`. It will download/prepare the downstream dataset and guide you through finetuning on this downstream task.
- Synonymous, DDD/ASD, and Cancer Hotspot variant datasets:
- Follow `notebooks/00-Mutation-Datasets-Preprocessing.ipynb`. This notebook includes a cell that lists the required input files (with expected names/locations) and outlines how to process them into harmonized formats.
- - After preprocessing, use the task-specific notebooks in `notebooks/` (e.g., `0-...CancerHotspot.ipynb` and `1-...DDD-ASD.ipynb`) which consume the harmonized outputs produced by the preprocessing notebook.
+ - After preprocessing, use the task-specific notebooks in `notebooks/` (fir example, `0-...CancerHotspot.ipynb` and `1-...DDD-ASD.ipynb`), which consume the harmonized outputs produced by the preprocessing notebook.
### Running Training/Finetuning/Evaluation
@@ -177,9 +157,12 @@ The main entry point is `src/runner.py` which supports three modes:
#### Pre-training
-The explicit scripts used to train the released checkpoints are referenced in [Pre-trained Models](#pre-trained-models). Note: if `--use_transformer_engine` is added TransformerEngine will be used, otherwise it will default to PyTorchs Scaled Dot Product Attention (SDPA).
+The explicit scripts used to train the released checkpoints are referenced in [Pre-trained Models](#pre-trained-models).
-Note. For some hardware devices, there may be issues with Transformer Engine's fused attention kernel and sequence packing (THD). To disable this kernel, please use `export NVTE_FUSED_ATTN=0`.
+```{note}
+- If `--use_transformer_engine` is added TransformerEngine will be used, otherwise it will default to PyTorchs Scaled Dot Product Attention (SDPA).
+- For some hardware devices, there may be issues with Transformer Engine's fused attention kernel and sequence packing (THD). To disable this kernel, use `export NVTE_FUSED_ATTN=0`.
+```
```bash
python -m src.runner pretrain \
@@ -205,7 +188,7 @@ Optional path overrides:
--pretrained_ckpt_path
```
-For multi-node execution please consider using `torchrun`.
+For multi-node execution consider using `torchrun`.
```bash
export NUM_GPUS=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | wc -l)
@@ -239,9 +222,9 @@ torchrun \
**Available `--dataset_name` options:**
-- `CodonMemmapDataset`: dataset to support memory-mapped pre-training dataset used for pre-training
-- `MutationDataset`: dataset for mutation prediction
-- `CodonBertDataset`: dataset to ingest codon sequences.
+- `CodonMemmapDataset`: Dataset to support memory-mapped pre-training dataset used for pre-training
+- `MutationDataset`: Dataset for mutation prediction
+- `CodonBertDataset`: Dataset to ingest codon sequences.
#### Fine-tuning
@@ -249,7 +232,7 @@ The publicly available checkpoints can be finetuned using the finetuning options
**Available finetuning options:**
-See example script at `experiment_scripts/pretraining/encodon_filtered/finetuning/`
+Refer to example script at `experiment_scripts/pretraining/encodon_filtered/finetuning/`.
- `lora`: Fine-tunes low-rank adapters within a pretrained model added to each transformer layer to reduce training cost and memory usage.
- `head_only_random`: Trains a randomly initialized output head while the remainder of the model is kept frozen.
@@ -278,7 +261,7 @@ The publicly available checkpoints can be used to launch scientific evaluation a
**Available tasks**
-- `mutation_prediction`: Scores a specified mutation via ref-vs-alt codon log-likelihood ratio.
+- `mutation_prediction`: Scores a specified mutation with ref-vs-alt codon log-likelihood ratio.
- `masked_language_modeling`: Predicts masked codon tokens from surrounding sequence context.
- `fitness_prediction`: Estimates sequence fitness as the mean log-likelihood of the sequence as predicted by the model.
- `embedding_prediction`: Extracts encoder CLS embeddings for each input.
@@ -300,20 +283,20 @@ python -m src.runner eval \
### Checkpoint conversion between PyTorch and TE
-[codonfm_ckpt_te_conversion.py](codonfm_ckpt_te_conversion.py) will convert PyTorch-native Encodon checkpoint TE and back, see [Pre-trained Models](#pre-trained-models).
+[codonfm_ckpt_te_conversion.py](codonfm_ckpt_te_conversion.py) will convert PyTorch-native Encodon checkpoint TE and back, refer to [Pre-trained Models](#pre-trained-models).
-## Using Weights and Biases with CodonFM
+## Using Weights and Biases With CodonFM
CodonFM can log all training and validation metrics to [Weights & Biases (WandB)](https://wandb.ai/), which requires an account. To use alternative solutions other than WandB, you can change the logging destination in [encodon_pl.py::training_step](src/models/encodon_pl.py) and [encodon_te_pl.py::training_step](src/models/encodon_te_pl.py).
-To use WandB with CodonFM, set your Weights & Biases API key for logging inside the running container
+To use WandB with CodonFM, set your Weights & Biases API key for logging inside the running container.
```bash
# WANDB key (optional; only needed if enabling --enable_wandb)
export WANDB_API_KEY=your_wandb_api_key
```
-or add your login info to `~/.netrc`.
+Alternatively, add your login info to `~/.netrc`.
When launching runs, enable WandB logging by passing `--enable_wandb` and providing `--project_name` and `--entity`. If these are omitted, WandB logging will be skipped.
@@ -326,4 +309,4 @@ Experiment launch scripts for reproducing pretraining and fine-tuning are under
## License
-See [LICENSE](LICENSE)
+Refer to [LICENSE](LICENSE).
diff --git a/bionemo-recipes/recipes/esm2_accelerate_te/README.md b/bionemo-recipes/recipes/esm2_accelerate_te/README.md
index 5ca59b4f0c..c3fc80a292 100644
--- a/bionemo-recipes/recipes/esm2_accelerate_te/README.md
+++ b/bionemo-recipes/recipes/esm2_accelerate_te/README.md
@@ -78,7 +78,7 @@ accelerate launch --config_file accelerate_config/fsdp2_te.yaml \
train.py --config-name=L0_sanity
```
-See [`slurm.sh`](slurm.sh) for an example SLURM script.
+Refer to [`slurm.sh`](slurm.sh) for an example SLURM script.
### FP8 Training
@@ -165,7 +165,7 @@ output = model(**inputs)
🚧 Under development
-## See Also
+## References
- [ESM-2 Training with Native PyTorch](../esm2_native_te/README.md)
- [Hugging Face Trainer Documentation](https://huggingface.co/docs/transformers/en/trainer)
@@ -173,7 +173,7 @@ output = model(**inputs)
## Developer Guide
-### Running tests
+### Running Tests
To run tests locally, run `recipes_local_test.py` from the repository root with the recipe directory as an argument.
@@ -181,9 +181,9 @@ To run tests locally, run `recipes_local_test.py` from the repository root with
./ci/scripts/recipes_local_test.py bionemo-recipes/recipes/esm2_accelerate_te/
```
-Tests should be kept relatively fast, using the smallest model and number of training steps required to validate the feature. Hardware requirements beyond those used in CI (e.g., a single L4) should be annotated with pytest.mark.requires, e.g. `requires_fp8` and `requires_multi_gpu`.
+Tests should be kept relatively fast, using the smallest model and number of training steps required to validate the feature. Hardware requirements beyond those used in CI, like a single L4, should be annotated with pytest.mark.requires, such as `requires_fp8` and `requires_multi_gpu`.
-### Development container
+### Development Container
To use the provided devcontainer, use "Dev Containers: Reopen in Container" from the VSCode menu, and choose the "BioNeMo Recipes Dev Container" option. To run the tests inside the container, run `pytest -v .` in the recipe directory.
@@ -191,7 +191,7 @@ To use the provided devcontainer, use "Dev Containers: Reopen in Container" from
[Hydra](https://hydra.cc/) is a powerful configuration management library for Python. This recipe uses Hydra to manage training configurations, allowing for easy modification of training hyper-parameters and model settings.
-Configuration parameters can be overridden from the command line, e.g.:
+Configuration parameters can be overridden from the command line, for example:
```bash
accelerate launch train.py --config-name L0_sanity fp8_config.enabled=true trainer.learning_rate=2e-5
diff --git a/bionemo-recipes/recipes/esm2_native_te/README.md b/bionemo-recipes/recipes/esm2_native_te/README.md
index e3d74c31ac..5c4b5c3cec 100644
--- a/bionemo-recipes/recipes/esm2_native_te/README.md
+++ b/bionemo-recipes/recipes/esm2_native_te/README.md
@@ -42,8 +42,8 @@ To run the container, run:
docker run -it --gpus all --network host --ipc=host --rm -v ${PWD}:/workspace/bionemo esm2_native_te /bin/bash
```
-Alternatively, the dependencies can be installed manually in an environment with CUDA support. See
-[Dockerfile.cuda](Dockerfile.cuda) for the process of installing dependencies in a fresh python environment (for e.g.,
+Alternatively, the dependencies can be installed manually in an environment with CUDA support. Refer to
+[Dockerfile.cuda](Dockerfile.cuda) for the process of installing dependencies in a fresh python environment (for example,
CUDA 13.0):
```bash
@@ -89,18 +89,18 @@ To run single-process training on one GPU, run:
python train_ddp.py # or train_fsdp2.py / train_mfsdp.py
```
-To run multi-process training locally on 2+ GPUs, run (e.g. 2 GPUs):
+To run multi-process training locally on 2+ GPUs, run:
```bash
torchrun --nproc_per_node=2 train_fsdp2.py # or train_mfsdp.py / train_ddp.py
```
-Multi-Node training is supported with all three strategies, see [`slurm.sh`](slurm.sh) for an example SLURM script.
+Multi-Node training is supported with all three strategies, refer to [`slurm.sh`](slurm.sh) for an example SLURM script.
### FP8 Training
To run training with FP8, enable it by overriding the `fp8_config.enabled=true` configuration parameter. Additional FP8
-configuration parameters, including switching to `MXFP8BlockScaling`, can be set via the hydra configuration.
+configuration parameters, including switching to `MXFP8BlockScaling`, can be set using the hydra configuration.
```bash
python train_fsdp2.py --config-name L0_sanity fp8_config.enabled=true
@@ -108,8 +108,8 @@ python train_fsdp2.py --config-name L0_sanity fp8_config.enabled=true
### Sequence Packing (THD input format)
-Sequence packing is handled via a padding-free collator (in `collator.py`) that provides input arguments (e.g.
-`cu_seq_lens_q`) needed for padding-free attention. To enable sequence packing, set `use_sequence_packing=true`
+Sequence packing is handled using a padding-free collator (in `collator.py`) that provides input arguments, such as
+`cu_seq_lens_q`), needed for padding-free attention. To enable sequence packing, set `use_sequence_packing=true`
in the hydra configuration.
```bash
@@ -131,11 +131,11 @@ python train_fsdp2.py --config-name L0_sanity \
We provide a training script [train_ddp_cp](./esm2_native_te/train_ddp_cp.py) and a sample config [L0_sanity_cp](./hydra_config/L0_sanity_cp.yaml) that uses context parallelism.
-In the config the argument `--cp_size` allows the user to set the size of the context parallel distributed group. When paired with Distributed Data Parallelism (DDP), the number of context parallel groups will be determined by `world_size//cp_size`.
+In the config, the argument `--cp_size` allows the user to set the size of the context parallel distributed group. When paired with Distributed Data Parallelism (DDP), the number of context parallel groups will be determined by `world_size//cp_size`.
-Thus, for example, if a user has 8 processes and sets `cp_size=2` they will have `2` CP groups and `4` DDP groups. During dataloading we make no assumptions about the data pipeline being deterministic or not. DDP groups will provide unique data while CP groups will contain replicates of that data.
+Thus, if a user has 8 processes and sets `cp_size=2` they will have `2` CP groups and `4` DDP groups. During dataloading we make no assumptions about the data pipeline being deterministic or not. DDP groups will provide unique data while CP groups will contain replicates of that data.
-For example, let's say that we have 2 DDP groups and 2 CP groups. Each DDP group will have a unique dataloader DP0 for DDP group 0
+For example, if we have 2 DDP groups and 2 CP groups. Each DDP group will have a unique dataloader DP0 for DDP group 0
and DP1 for DDP group 1. CP works by running something called ring attention, which expects tokens to live on each device in a particular layout. For this CP implementation we use something called [Dual Chunk Swapping](https://github.com/NVIDIA/TransformerEngine/blob/1df4a69f761672f633d40ea3605327087d1ea737/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py#L3714-L3770). If DP0 outputs sequence `1 2 3 4 5 6 7 8` and DP1 outputs `9 10 11 12 13 14 15 16` then when we run through the `CPAwareDataloader` defined in [datasets](./dataset.py), the dataloader will create CP shards from that DP group as follows:
```
@@ -144,9 +144,9 @@ and DP1 for DDP group 1. CP works by running something called ring attention, wh
CP1 | 3,4,5,6 | 11, 12, 13, 14|
```
-You may notice these shards and wonder why they are the way they are. We did. The reason is that CP groups are sharded using slices. The full input sequence (such as `1 2 3 4 5 6 7`) is sliced into `2 * cp_size` groups. Then CP0 takes the first and last slice, while CP1 takes the middle slices, of each sequence.
+You may notice these shards and wonder why they are the way they are. The reason is that CP groups are sharded using slices. The full input sequence (such as `1 2 3 4 5 6 7`) is sliced into `2 * cp_size` groups. Then CP0 takes the first and last slice, while CP1 takes the middle slices, of each sequence.
-In this example we only show one sequence but its important to note that slicing takes place on every sequence, so if a second sequence is also available, that will be sliced in the same manner. CP0 will take the first and last slice of every sequence, while CP1 will take the middle slices of each sequence.
+In this example, we only show one sequence but its important to note that slicing takes place on every sequence, so if a second sequence is also available, that will be sliced in the same manner. CP0 will take the first and last slice of every sequence, while CP1 will take the middle slices of each sequence.
### Comparing Against the HF Transformers Reference Implementation
@@ -161,7 +161,7 @@ python train_fsdp2.py --config-name L0_sanity model_tag=facebook/esm2_t6_8M_UR50
An example pre-training dataset for ESM-2 is available in the
[`nvidia/esm2_uniref_pretraining_data`](https://huggingface.co/datasets/nvidia/esm2_uniref_pretraining_data) Hugging
-Face dataset. This dataset can be [streamed](https://huggingface.co/docs/datasets/en/stream) from the Hugging Face Hub via
+Face dataset. This dataset can be [streamed](https://huggingface.co/docs/datasets/en/stream) from the Hugging Face Hub by using the following.
```python
>>> from datasets import load_dataset
@@ -172,7 +172,7 @@ Face dataset. This dataset can be [streamed](https://huggingface.co/docs/dataset
'ur90_id': 'UniRef90_UPI002FBE17D9'}
```
-For large-scale training, the dataset should be downloaded locally via the [huggingface
+For large-scale training, the dataset should be downloaded locally with the [huggingface
CLI](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli), with appropriate values set for
`HF_HOME` and `HF_TOKEN` environment variables. Use `uv tool install huggingface_hub` to install the CLI if not already
installed.
@@ -261,13 +261,13 @@ output = model(**inputs)
🚧 Under development
-## See Also
+## Reference
- [ESM-2 Training with Accelerate](../esm2_accelerate_te/README.md)
## Developer Guide
-### Running tests
+### Running Tests
To run tests locally, run `recipes_local_test.py` from the repository root with the recipe directory as an argument.
@@ -279,7 +279,7 @@ Tests should be kept relatively fast, using the smallest model and number of tra
feature. Hardware requirements beyond those used in CI (e.g., a single L4) should be annotated with
pytest.mark.requires, e.g. `requires_fp8` and `requires_multi_gpu`.
-### Development container
+### Development Container
To use the provided devcontainer, use "Dev Containers: Reopen in Container" from the VSCode menu, and choose the
"BioNeMo Recipes Dev Container" option. To run the tests inside the container, run `pytest -v .` in the recipe
@@ -290,7 +290,7 @@ directory.
[Hydra](https://hydra.cc/) is a powerful configuration management library for Python. This recipe uses Hydra to manage
training configurations, allowing for easy modification of training hyper-parameters and model settings.
-Configuration parameters can be overridden from the command line, e.g.
+Configuration parameters can be overridden from the command line. For example,
`python train_fsdp2.py --config-name L0_sanity fp8_config.enabled=true`.
For verbose logging, use the hydra command line override `hydra.verbose=true`, see
diff --git a/bionemo-recipes/recipes/esm2_peft_te/README.md b/bionemo-recipes/recipes/esm2_peft_te/README.md
index 98bc76396b..60435aff4a 100644
--- a/bionemo-recipes/recipes/esm2_peft_te/README.md
+++ b/bionemo-recipes/recipes/esm2_peft_te/README.md
@@ -3,5 +3,5 @@
This folder demonstrates how to fine-tune a TransformerEngine-accelerated ESM-2 model using PEFT.
Note: This recipe is a work in progress, and currently only demonstrates basic support for LoRA fine-tuning and
-TransformerEngine layers. See `bionemo-recipes/models/esm2/tests/test_peft.py` for additional information and known
+TransformerEngine layers. Refer to `bionemo-recipes/models/esm2/tests/test_peft.py` for additional information and known
limitations.
diff --git a/bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/README.md b/bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/README.md
index 5aa1ce8f44..09f035c84c 100644
--- a/bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/README.md
+++ b/bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/README.md
@@ -7,9 +7,9 @@ This file contains comprehensive documentation specifically designed for AI agen
# Geneformer Pretraining with mfsdp and a custom pytorch training loop.
-The code runs inside of a container. To construct this container please look at [container build](#container-build) and [container run](#container-run). In this folder we supply a pretraining script capable of training several variants of [Geneformer](https://huggingface.co/ctheodoris/Geneformer). Those variants are located in our [hydra_config](hydra_config/). This code was forked from the original geneformer repository, and enhanced to increase its performance.
+The code runs inside of a container. To construct this container refer to [container build](#container-build) and [container run](#container-run). In this folder, we supply a pretraining script capable of training several variants of [Geneformer](https://huggingface.co/ctheodoris/Geneformer). Those variants are located in our [hydra_config](hydra_config/). This code was forked from the original geneformer repository, and enhanced to increase its performance.
-[Geneformer](https://www.nature.com/articles/s41586-023-06139-9) is a BERT-based transformer pretrained on single-cell transcriptomes. For more information, please see the nature paper [here](https://www.nature.com/articles/s41586-023-06139-9).
+[Geneformer](https://www.nature.com/articles/s41586-023-06139-9) is a BERT-based transformer pretrained on single-cell transcriptomes. For more information, refer to the nature paper [here](https://www.nature.com/articles/s41586-023-06139-9).
## Training Commands
@@ -26,7 +26,7 @@ torchrun --nproc_per_node= train.py --config-name
torchrun --nproc_per_node=1 train.py
```
-> **Note:** The config name is the filename without `.yaml` extension (e.g., `4b` for `4b.yaml`).
+> **Note:** The config name is the filename without `.yaml` extension (for example, `4b` for `4b.yaml`).
### Advanced Configuration
@@ -107,7 +107,7 @@ data:
path: "/workspace/data/Genecorpus-30M/genecorpus_1M_samples.parquet" # Path to the training dataset file
```
-For detailed model-specific configuration files, see the [hydra_config/model](./hydra_config/model) directory. Some example configs have already been provided such as
+For detailed model-specific configuration files, refer to the [hydra_config/model](./hydra_config/model) directory. Some example configs have already been provided such as
You can find the full configuration for the 4B parameter model in [`hydra_config/model/4b.yaml`](./hydra_config/model/4b.yaml).
## Checkpoint Management
@@ -220,13 +220,13 @@ docker run -it --gpus all --network host --ipc=host \
### WandB
-We support full integration with weights and biases. To use this please supply the environment variable:
+We support full integration with weights and biases. To use this, the environment variable:nter
```
export WANDB_API_KEY=
```
-and supply the hydra config section `wandb_init_args` with your experiment name and project.
+Also, enter your experiment name and project in the hydra config section `wandb_init_args`.
### Dataset
diff --git a/bionemo-recipes/recipes/vit/README.md b/bionemo-recipes/recipes/vit/README.md
index 16211ad88c..523b2faaa6 100644
--- a/bionemo-recipes/recipes/vit/README.md
+++ b/bionemo-recipes/recipes/vit/README.md
@@ -38,7 +38,7 @@ To train a ViT using FSDP, execute the following command in your Docker containe
torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
```
-which will train on the [`AI-Lab-Makerere/ibean`](https://github.com/AI-Lab-Makerere/ibean/) (HuggingFace: [`AI-Lab-Makerere/beans`](https://huggingface.co/datasets/AI-Lab-Makerere/beans)) dataset and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
+This will train on the [`AI-Lab-Makerere/ibean`](https://github.com/AI-Lab-Makerere/ibean/) (HuggingFace: [`AI-Lab-Makerere/beans`](https://huggingface.co/datasets/AI-Lab-Makerere/beans)) dataset and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
[`train.py`](train.py) is the transparent entrypoint to this script that explains how to modify your own training loop for `Megatron-FSDP` ([PyPI: `megatron-fsdp`](https://pypi.org/project/megatron-fsdp/) / [Source: Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src)) to fully-shard your model across all devices.
@@ -88,7 +88,7 @@ def load_dcp_checkpoint(checkpoint_path, model=None, optimizer=None):
optimizer.load_state_dict(state_dict["optimizer"])
```
-which can be loaded directly into the `MegatronFSDP` model:
+This can be loaded directly into the `MegatronFSDP` model:
```python
# Create a MegatronFSDP model and optimizer.