Misc packing improvements by mariosasko · Pull Request #5189 · huggingface/trl

mariosasko · 2026-02-26T15:50:42Z

What does this PR do?

This PR improves the packing logic to make it faster, less error-prone, and easier to read.

The main changes are:

Replacing the AI-generated BFD splitting (a.k.a. "requeuing") logic from Preserve truncated tokens in BFD packing #4632
with a vectorized implementation that is significantly shorter and 30% faster.
Updating pack_examples to restore the input dataset’s format and perform proper input validation.
Applying a minor optimization to the wrapped packing implementation by reusing the offsets across all packed columns.
Aligning the naming with the literature (e.g., requeue → split) while preserving backward compatibility.

P.S. The recent Qwen3-Coder-Next technical report includes a nice comparison of packing techniques, which would be a nice addition to the docs. However, the report is not yet available on arXiv, so it cannot be cited as an HF paper.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Medium Risk
Touches core data preprocessing/packing used during training, changing overflow behavior naming (bfd-requeue→bfd_split) and format handling; issues could surface as subtle tokenization/packing or dataset formatting regressions.

Overview
Improves dataset packing/truncation utilities by introducing PackingStrategy (with alias/backward-compat parsing) and updating pack_dataset to normalize strategies, validate packable list columns, and restore the caller’s original dataset format after Arrow-based mapping.

Replaces the previous BFD “requeue” overflow handling with a shorter, vectorized split-on-overflow implementation (bfd_split), and slightly optimizes the wrapped strategy by reusing computed offsets across columns. truncate_dataset is similarly unified to use the fast Arrow path for both Dataset and iterable/dict variants while preserving formatting.

Updates SFTConfig/SFTTrainer to use the enum (including __post_init__ coercion) and to treat both bfd and bfd_split as requiring padding-free mode; docs and tests are updated accordingly.

^{Written by Cursor Bugbot for commit 5caa917. This will update automatically on new commits. Configure here.}

HuggingFaceDocBuilderDev · 2026-02-26T16:30:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/data_utils.py

…-chunking

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-09T20:28:23Z

trl/data_utils.py

+        offsets = np.arange(np.sum(num_fragments) + 1, dtype=columns[0].offsets.type.to_pandas_dtype()) * seq_length
+        # "Left-shift" the offsets to account for the last fragment of each original sequence possibly being shorter than `seq_length`
+        diff = np.zeros_like(offsets)
+        diff[np.cumsum(num_fragments)] = -lengths % seq_length


Vectorized split loses corrections for zero-length sequences

Medium Severity

When the dataset contains any zero-length sequences, np.cumsum(num_fragments) produces duplicate indices (since num_fragments is 0 for empty rows). The fancy indexing assignment diff[np.cumsum(num_fragments)] = -lengths % seq_length then silently overwrites earlier correction values with the zero-length row's correction (which is 0). This causes the offset adjustments for preceding non-empty sequences to be lost, producing offsets that exceed the values buffer length and triggering a PyArrow out-of-bounds error. For example, with lengths=[5, 0] and seq_length=4, the correction of 3 for the length-5 sequence gets overwritten, yielding offsets [0, 4, 8] instead of [0, 4, 5]. Using np.add.at instead of direct fancy indexing would accumulate rather than overwrite.

qgallouedec

thanks, just

qgallouedec · 2026-03-10T05:20:11Z

docs/source/data_utils.md

+
+## truncate_dataset
+
+[[autodoc]] truncate_dataset
+
+## pack_dataset
+
+[[autodoc]] pack_dataset
+
+## PackingStrategy
+
+[[autodoc]] PackingStrategy


Suggested change

## truncate_dataset

[[autodoc]] truncate_dataset

## pack_dataset

[[autodoc]] pack_dataset

## PackingStrategy

[[autodoc]] PackingStrategy

see #5090

qgallouedec · 2026-03-10T05:21:15Z

trl/data_utils.py

+class PackingStrategy(str, Enum):
+    """Possible values for the packing strategy."""
+
+    BFD = "bfd"
+    BFD_SPLIT = "bfd_split"
+    WRAPPED = "wrapped"
+
+    @classmethod
+    def _missing_(cls, value):
+        if isinstance(value, str):
+            normalized = value.lower().replace("-", "_")
+            if normalized in {member.value for member in cls}:
+                return cls(normalized)
+
+            aliases = {
+                "bfd_truncate": "bfd",
+                "bfd_requeue": "bfd_split",
+            }
+            if normalized in aliases:
+                return cls(aliases[normalized])
+
+        # Copied from https://github.com/huggingface/transformers/blob/1a50a3b13b6d17c2637fe19e94a8c459bd4208a5/src/transformers/utils/generic.py#L485-L487
+        raise ValueError(
+            f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
+        )
+


ideally we want to keep things simple, ie not use enum when possible

qgallouedec · 2026-03-10T05:28:23Z

trl/trainer/sft_config.py

        metadata={
            "help": "Strategy for packing sequences. Can be `'bfd'` (best-fit decreasing, truncates overflow), "
-            "`'bfd-requeue'` (best-fit decreasing, re-queues overflow tokens), or `'wrapped'` (aggressive, cuts "
+            "`'bfd_split'` (best-fit decreasing, splits overflow sequences), or `'wrapped'` (aggressive, cuts "


Aligning the naming with the literature (e.g., requeue → split) while preserving backward compatibility.

nice! I quickly checked the Qwen3-Coder-Tech report, what that call "concat-then-split" is closer to the "wrapped" strategy no?

mariosasko added 5 commits February 26, 2026 11:01

Vectorize splitting of sequences longer than seq_length in BFD packing

ae3f3d9

Minor improvements

ce371f0

Merge with upstream

d5b5ec3

Nit

b393b4e

More nits

a32b5a5

qgallouedec requested a review from albertvillanova February 26, 2026 16:24

mariosasko and others added 2 commits March 4, 2026 14:06

Merge branch 'main' into vectorized-bfd-chunking

c216184

Merge branch 'main' into vectorized-bfd-chunking

819c2f8

cursor bot reviewed Mar 7, 2026

View reviewed changes

trl/data_utils.py Outdated Show resolved Hide resolved

trl/data_utils.py Show resolved Hide resolved

mariosasko added 3 commits March 9, 2026 19:54

Improvements and fixes to make the bot happy

25b0502

Tests

fb6ffc9

Merge branch 'main' of github.com:huggingface/trl into vectorized-bfd…

dc23b74

…-chunking

cursor bot reviewed Mar 9, 2026

View reviewed changes

Nit

5caa917

qgallouedec reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc packing improvements#5189

Misc packing improvements#5189
mariosasko wants to merge 11 commits intohuggingface:mainfrom
mariosasko:vectorized-bfd-chunking

mariosasko commented Feb 26, 2026 •

edited by cursor bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 9, 2026

Uh oh!

qgallouedec left a comment

Uh oh!

qgallouedec Mar 10, 2026

Uh oh!

qgallouedec Mar 10, 2026

Uh oh!

qgallouedec Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mariosasko commented Feb 26, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 9, 2026

Choose a reason for hiding this comment

Vectorized split loses corrections for zero-length sequences

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mariosasko commented Feb 26, 2026 •

edited by cursor bot

Loading