[README] Add explanations for varlen training

yzhangcs · web-flow · commit f5c618c64607 · 2025-09-12T15:04:54.000+08:00
diff --git a/README.md b/README.md
@@ -415,6 +415,14 @@ options:
 ```
 </details>
 
+### Training with variable-length inputs
+When you set the `--training.varlen` flag, you're enabling a more efficient training method that packs multiple documents together into a single long sequence, eliminating the need for padding. 
+This is particularly useful when your dataset contains documents of varying lengths. 
+Let's break down how `--training.seq_len` and `--training.context_len` work in this mode.
+
+* `--training.seq_len` (Packed Sequence Length): This is the total length of the final sequence fed to the model on one device. Instead of processing one document at a time, the dataloader takes multiple documents (each split to sequences no longer than `context_len`), concatenates them end-to-end, and creates a single long sequence of length `seq_len`.
+* `--training.context_len` (Sample Length): This parameter defines the maximum number of tokens for a single document or sample. If a document from the dataset is longer than `context_len`, it will be truncated. For example, if `--training.context_len` is set to 4,096, a document with 5,000 tokens will be cut down to its first 4,096 tokens, leaving the left tokens as another independent sequence, while a document with 3000 tokens remains unchanged.
+
 ### Training with `torch.compile`
 
 Starting from `torch 2.0`, `torch.compile` has been introduced as a new feature to seamlessly accelerate training processes.