Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| offsets = np.arange(np.sum(num_fragments) + 1, dtype=columns[0].offsets.type.to_pandas_dtype()) * seq_length | ||
| # "Left-shift" the offsets to account for the last fragment of each original sequence possibly being shorter than `seq_length` | ||
| diff = np.zeros_like(offsets) | ||
| diff[np.cumsum(num_fragments)] = -lengths % seq_length |
There was a problem hiding this comment.
Vectorized split loses corrections for zero-length sequences
Medium Severity
When the dataset contains any zero-length sequences, np.cumsum(num_fragments) produces duplicate indices (since num_fragments is 0 for empty rows). The fancy indexing assignment diff[np.cumsum(num_fragments)] = -lengths % seq_length then silently overwrites earlier correction values with the zero-length row's correction (which is 0). This causes the offset adjustments for preceding non-empty sequences to be lost, producing offsets that exceed the values buffer length and triggering a PyArrow out-of-bounds error. For example, with lengths=[5, 0] and seq_length=4, the correction of 3 for the length-5 sequence gets overwritten, yielding offsets [0, 4, 8] instead of [0, 4, 5]. Using np.add.at instead of direct fancy indexing would accumulate rather than overwrite.
|
|
||
| ## truncate_dataset | ||
|
|
||
| [[autodoc]] truncate_dataset | ||
|
|
||
| ## pack_dataset | ||
|
|
||
| [[autodoc]] pack_dataset | ||
|
|
||
| ## PackingStrategy | ||
|
|
||
| [[autodoc]] PackingStrategy |
There was a problem hiding this comment.
| ## truncate_dataset | |
| [[autodoc]] truncate_dataset | |
| ## pack_dataset | |
| [[autodoc]] pack_dataset | |
| ## PackingStrategy | |
| [[autodoc]] PackingStrategy |
see #5090
| class PackingStrategy(str, Enum): | ||
| """Possible values for the packing strategy.""" | ||
|
|
||
| BFD = "bfd" | ||
| BFD_SPLIT = "bfd_split" | ||
| WRAPPED = "wrapped" | ||
|
|
||
| @classmethod | ||
| def _missing_(cls, value): | ||
| if isinstance(value, str): | ||
| normalized = value.lower().replace("-", "_") | ||
| if normalized in {member.value for member in cls}: | ||
| return cls(normalized) | ||
|
|
||
| aliases = { | ||
| "bfd_truncate": "bfd", | ||
| "bfd_requeue": "bfd_split", | ||
| } | ||
| if normalized in aliases: | ||
| return cls(aliases[normalized]) | ||
|
|
||
| # Copied from https://github.com/huggingface/transformers/blob/1a50a3b13b6d17c2637fe19e94a8c459bd4208a5/src/transformers/utils/generic.py#L485-L487 | ||
| raise ValueError( | ||
| f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
ideally we want to keep things simple, ie not use enum when possible
| metadata={ | ||
| "help": "Strategy for packing sequences. Can be `'bfd'` (best-fit decreasing, truncates overflow), " | ||
| "`'bfd-requeue'` (best-fit decreasing, re-queues overflow tokens), or `'wrapped'` (aggressive, cuts " | ||
| "`'bfd_split'` (best-fit decreasing, splits overflow sequences), or `'wrapped'` (aggressive, cuts " |
There was a problem hiding this comment.
Aligning the naming with the literature (e.g., requeue → split) while preserving backward compatibility.
nice! I quickly checked the Qwen3-Coder-Tech report, what that call "concat-then-split" is closer to the "wrapped" strategy no?


What does this PR do?
This PR improves the packing logic to make it faster, less error-prone, and easier to read.
The main changes are:
Replacing the AI-generated BFD splitting (a.k.a. "requeuing") logic from Preserve truncated tokens in BFD packing #4632
with a vectorized implementation that is significantly shorter and 30% faster.
Updating
pack_examplesto restore the input dataset’s format and perform proper input validation.Applying a minor optimization to the
wrappedpacking implementation by reusing theoffsetsacross all packed columns.Aligning the naming with the literature (e.g.,
requeue→split) while preserving backward compatibility.P.S. The recent Qwen3-Coder-Next technical report includes a nice comparison of packing techniques, which would be a nice addition to the docs. However, the report is not yet available on arXiv, so it cannot be cited as an HF paper.
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Note
Medium Risk
Touches core data preprocessing/packing used during training, changing overflow behavior naming (
bfd-requeue→bfd_split) and format handling; issues could surface as subtle tokenization/packing or dataset formatting regressions.Overview
Improves dataset packing/truncation utilities by introducing
PackingStrategy(with alias/backward-compat parsing) and updatingpack_datasetto normalize strategies, validate packable list columns, and restore the caller’s original dataset format after Arrow-based mapping.Replaces the previous BFD “requeue” overflow handling with a shorter, vectorized split-on-overflow implementation (
bfd_split), and slightly optimizes thewrappedstrategy by reusing computed offsets across columns.truncate_datasetis similarly unified to use the fast Arrow path for bothDatasetand iterable/dict variants while preserving formatting.Updates
SFTConfig/SFTTrainerto use the enum (including__post_init__coercion) and to treat bothbfdandbfd_splitas requiring padding-free mode; docs and tests are updated accordingly.Written by Cursor Bugbot for commit 5caa917. This will update automatically on new commits. Configure here.