-
Notifications
You must be signed in to change notification settings - Fork 113
BIO-48: Support for packed bshd seq dataloader and more advanced fp8 integration for llama3 #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
|
|
||
| @dataclass | ||
| class SequencePackingIterableDataset(torch.utils.data.IterableDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just wrap the existing TokenPackingDataset to do this?
set max_tokens_per_batch to be the desired sequence length, set split_samples=True; then before you return the sample just concatenate along the sequence dimension
| return ContextParallelDataLoaderWrapper(train_dataloader, cp_mesh), tokenized_dataset | ||
|
|
||
|
|
||
| def create_bshd_packed_dataloader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be an option in the existing create_bshd_dataset function? might be simpler rather than repeating all this.
I'm wondering though if there's a better way to structure this dataset file where it can be more modular
| f"+wandb.dir={tmp_path}", | ||
| f"checkpoint.ckpt_dir={tmp_path}", | ||
| "fp8_config.enabled=true", | ||
| "+dataset.pad_to_multiple_of=16", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, if you have the fully-packed BSHD dataset, could we just use that here?
|
Quick question to check my understanding -- "Packed BSHD dataloader" is a feature that is only relevant for llama3/autoregressive models, is that right? I.e. such a dataloader for BERT would not be useful to train an embedding model? |
|
It mirrors what Megatron-LM does when pre-training large models, which is just concatenating across sequence boundaries without really worrying about document start / ends. Not sure whether BERT would be fine if you trained it like that, but i think the auto-regressive loss makes it slightly less problematic |
|
/ok to test f5bc6c3 |
- Add SequencePackingIterableDataset to collator.py for fixed-length BSHD samples - Add create_bshd_packed_dataloader function to dataset.py - Add pad_to_multiple_of parameter to create_bshd_dataloader for FP8 compatibility
Adds a config parameter fp8_first_last_bf16 that when enabled keeps the first and last transformer layers in bf16 while using FP8 for middle layers. This can improve numerical stability during FP8 training.
- Add test_train_fsdp2_fp8_first_last_bf16 to test the fp8_first_last_bf16 config - Add pad_sequences_to_be_divisible_by=16 to THD FP8 test for proper FP8 padding
- Update train_fsdp2.py and train_ddp.py to toggle between dataloaders: - use_sequence_packing=true + attn_input_format=bshd -> BSHD packed - use_sequence_packing=true + attn_input_format=thd -> THD packed - use_sequence_packing=false -> BSHD unpacked - Add test_train_fsdp2_fp8_bshd_packed test for FP8 with BSHD packing Signed-off-by: Savitha Srinivasan <[email protected]>
f5bc6c3 to
7e75003
Compare
Description
This PR adds FP8 support enhancements for llama3 training building off #1416, including:
Packed BSHD dataloader - A new
SequencePackingIterableDatasetandcreate_bshd_packed_dataloaderthat packs sequences by concatenating across boundaries for BSHD format. Unlike THD packing (which tracks boundaries with cu_seqlens), this yields fixed-length samples with no padding, allowing attention to flow across packed sequences. This has been used as a baseline to compare THD to for FP8 experimentation.BSHD packing toggle in training scripts - Updated
train_fsdp2.pyandtrain_ddp.pyto automatically select the dataloader based on config:use_sequence_packing=true+attn_input_format=bshd→ BSHD packed (cross-boundary attention, no cu_seqlens)use_sequence_packing=true+attn_input_format=thd→ THD packed (respects boundaries via cu_seqlens)use_sequence_packing=false→ BSHD unpacked (standard windowing)LM head bf16 for FP8 - Wraps the lm_head forward pass with
fp8_autocast(enabled=False)to keep it in bf16 for numerical stability during FP8 training. (this is currently also present in Add fp8 tests for llama3 #1416)Configurable first/last layer bf16 - Adds
fp8_first_last_bf16config option that keeps the first and last transformer layers in bf16 while using FP8 for middle layers. This can improve numerical stability during FP8 training.FP8 tests - Adds training tests for FP8 in BSHD, THD, and BSHD packed modes, plus a test for the first/last bf16 feature.
Usage
Enable FP8 training with first/last layers in bf16
with initialize_config_dir(config_dir="hydra_config", version_base="1.2"):
config = compose(
config_name="L0_sanity",
overrides=[
"fp8_config.enabled=true",
"+dataset.pad_to_multiple_of=16", # Required for FP8
"+config_kwargs.fp8_first_last_bf16=true", # Keep first/last layers in bf16
],
)
Enable FP8 with BSHD packed dataloader via config
with initialize_config_dir(config_dir="hydra_config", version_base="1.2"):
config = compose(
config_name="L0_sanity",
overrides=[
"fp8_config.enabled=true",
"use_sequence_packing=true",
"config_kwargs.attn_input_format=bshd", # Triggers BSHD packed dataloader
"+dataset.pad_to_multiple_of=16",
],
)
Or use packed BSHD dataloader directly
from dataset import create_bshd_packed_dataloader
dataloader, dataset = create_bshd_packed_dataloader(
distributed_config=dist_config,
tokenizer_name_or_path="nvidia/Llama-3.1-8B-Instruct-FP8",
load_dataset_kwargs={"path": "parquet", "data_files": "data.parquet", "streaming": True},
micro_batch_size=4,
max_seq_length=8192,
pad_to_multiple_of=16, # For FP8 compatibility
)### Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Pre-submit Checklist