Skip to content

Commit f5c618c

Browse files
authored
[README] Add explanations for varlen training
1 parent 0a1454e commit f5c618c

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,14 @@ options:
415415
```
416416
</details>
417417

418+
### Training with variable-length inputs
419+
When you set the `--training.varlen` flag, you're enabling a more efficient training method that packs multiple documents together into a single long sequence, eliminating the need for padding.
420+
This is particularly useful when your dataset contains documents of varying lengths.
421+
Let's break down how `--training.seq_len` and `--training.context_len` work in this mode.
422+
423+
* `--training.seq_len` (Packed Sequence Length): This is the total length of the final sequence fed to the model on one device. Instead of processing one document at a time, the dataloader takes multiple documents (each split to sequences no longer than `context_len`), concatenates them end-to-end, and creates a single long sequence of length `seq_len`.
424+
* `--training.context_len` (Sample Length): This parameter defines the maximum number of tokens for a single document or sample. If a document from the dataset is longer than `context_len`, it will be truncated. For example, if `--training.context_len` is set to 4,096, a document with 5,000 tokens will be cut down to its first 4,096 tokens, leaving the left tokens as another independent sequence, while a document with 3000 tokens remains unchanged.
425+
418426
### Training with `torch.compile`
419427

420428
Starting from `torch 2.0`, `torch.compile` has been introduced as a new feature to seamlessly accelerate training processes.

0 commit comments

Comments
 (0)