[data] feat: add DynamicBatchingSizeDataset for stateful multi-worker dynamic batching by LiuzcEECS · Pull Request #488 · ByteDance-Seed/VeOmni

LiuzcEECS · 2026-02-18T07:34:56Z

What does this PR do?

Add DynamicBatchingSizeDataset, a new dataset-level dynamic batching approach that constructs micro batches inside DataLoader worker processes (as opposed to the existing DynamicBatchSizeDataLoader which does it in the main process). This enables proper checkpoint/resume support via StatefulDataLoader's state_dict()/load_state_dict() for dynamic batching, and is controlled by the new dyn_bsz_in_worker_loop flag.

Test

All 9 tests pass (pytest tests/data/test_dynamic_batching_dataset.py):

5 unit tests: basic dynamic batching (shuffle on/off), force long sequence, last batch draining, without get_item
4 distributed tests via torchrun --nproc_per_node=2: shuffle × save_by_idx (2×2 combinations), each running 2 epochs of training with checkpoint save at step 2 of epoch 1 and resume verification

API and Usage Example

New arguments:

# Use dataset-level dynamic batching (DynamicBatchingSizeDataset) instead of main-process batching (DynamicBatchSizeDataLoader)
--train.dyn_bsz_in_worker_loop=false

# Control whether to save buffer by index (way smaller checkpoint) or by full sample
--train.dyn_bsz_dataset_save_by_idx=true

When dyn_bsz_in_worker_loop=false, build_native_dataloader wraps the dataset with DynamicBatchingSizeDataset, which yields pre-batched micro batches from each worker.

When dyn_bsz_in_worker_loop=true (default, existing behavior), the original DynamicBatchSizeDataLoader path is used — no behavior change.

Design & Code Changes

Core: DynamicBatchingSizeDataset (veomni/data/dynamic_batching.py)

New IterableDataset subclass that buffers samples in each worker and yields micro batches when the buffer has ≥ ready_for_micro_batch_threshold samples and ≥ micro_batch_seq_length total tokens.
Greedy bin-packing in _get_micro_batch(): iterates buffer, selects samples fitting within the token budget.
Supports state_dict() / load_state_dict() for StatefulDataLoader checkpoint/resume:
- save_by_idx=True: saves only sample indices (smaller checkpoint), requires dataset to support get_item() and output_refetch_idx.
- save_by_idx=False: saves full buffer contents via deepcopy.

Integration (veomni/data/data_loader.py)

build_native_dataloader now branches on dyn_bsz_in_worker_loop:
- True: existing TextBatchingStrategy + DynamicBatchSizeDataLoader path.
- False: wraps dataset in DynamicBatchingSizeDataset, uses NoopDataCollator, returns StatefulDataLoader directly (no DynamicBatchSizeDataLoader wrapper).

Arguments (veomni/arguments/arguments_types.py)

dyn_bsz_in_worker_loop: bool = True — controls which dynamic batching path to use.
dyn_bsz_dataset_save_by_idx: bool = True — controls checkpoint buffer serialization strategy.
Updated dataloader_batch_size calculation for dyn_bsz_in_worker_loop=False case.

Minor refactor (veomni/data/batching_strategy.py)

Renamed BaseBatchingStrategy.is_full_filled() → is_ready_for_micro_batch() for clarity.

Tests (tests/data/test_dynamic_batching_dataset.py, tests/data/utils.py)

DummyMappingDataset / DummyIterableDataset: test fixtures with configurable sequence lengths, rank/worker sharding (like StreamingDataset in ByteDance), and state_dict() support.
Unit tests covering basic batching, long sequence handling, dataset exhaustion.
Distributed tests running multi-epoch training with checkpoint save/resume verification across shuffle × save_by_idx combinations.

CLAassistant · 2026-02-18T07:35:02Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces DynamicBatchingSizeDataset, a new approach for dynamic batching that operates within DataLoader worker processes. This is a significant feature that enables stateful checkpointing and resumption for dynamic batching with multiple workers. The changes are extensive, including the new dataset implementation, new arguments to control it, integration into the data loader build process, and a comprehensive new test suite with both unit and distributed tests.

Overall, the implementation is solid and the tests are thorough. However, I've identified two critical issues. One is a potential breaking change in argument handling that could affect existing user configurations for dynamic batching. The other is a bug in the new DynamicBatchingSizeDataset's iterator implementation that would prevent correct multi-epoch training. Addressing these issues is crucial for the stability and correctness of this new feature.

veomni/arguments/arguments_types.py

gemini-code-assist · 2026-02-18T07:40:10Z

veomni/data/dynamic_batching.py

+        if not self._data_iter:
+            self._data_iter = iter(self.dataset)


The __iter__ method does not correctly reset the underlying data iterator (self._data_iter) for new epochs. It's initialized only if it's None. After the first epoch, self._data_iter will be an exhausted iterator. When a new epoch starts and __iter__ is called again, it will reuse the exhausted iterator, causing subsequent epochs to yield no data or incomplete data. This breaks multi-epoch training.

The iterator from the upstream dataset should be re-initialized at the beginning of every __iter__ call to ensure each epoch processes the data correctly from the start.

self._data_iter = iter(self.dataset)

I think you are wrong, in the pytorch dataloader, when iter is started, it will copy the state of the dataset inited in the main process, in the main process, we only init it and the _data_iter is always None in the main process. Then the iter func will try to assign value to it in the beginning of looping so we should be fine.

Luosuu · 2026-02-18T18:01:01Z

should we simply replace the original one in this case? Why would we need two if their difference is just where to perform batching

veomni/arguments/arguments_types.py

Luosuu

Let's add a feature switch to hide this first for users. Let's test this first

…dynamic batching construction

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

LiuzcEECS self-assigned this Feb 18, 2026

Luosuu reviewed Feb 18, 2026

View reviewed changes

veomni/arguments/arguments_types.py Outdated Show resolved Hide resolved

Luosuu reviewed Feb 18, 2026

View reviewed changes

LiuzcEECS added 2 commits February 19, 2026 01:00

[data] feat: add DynamicBatchingSizeDataset to support multi-process …

5a15caf

…dynamic batching construction

add dataloader tests

1424f90

LiuzcEECS force-pushed the zhichao/dynamic-batching-dataset branch from 7cb0d4b to 469a99a Compare February 19, 2026 01:59

resolve comments

4dc4bbb

LiuzcEECS force-pushed the zhichao/dynamic-batching-dataset branch from 469a99a to 4dc4bbb Compare February 19, 2026 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] feat: add DynamicBatchingSizeDataset for stateful multi-worker dynamic batching#488

[data] feat: add DynamicBatchingSizeDataset for stateful multi-worker dynamic batching#488
LiuzcEECS wants to merge 3 commits intomainfrom
zhichao/dynamic-batching-dataset

LiuzcEECS commented Feb 18, 2026

Uh oh!

CLAassistant commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 18, 2026

Uh oh!

LiuzcEECS Feb 19, 2026

Uh oh!

Luosuu commented Feb 18, 2026

Uh oh!

Uh oh!

Luosuu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

LiuzcEECS commented Feb 18, 2026

What does this PR do?

Test

API and Usage Example

Design & Code Changes

Uh oh!

CLAassistant commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

LiuzcEECS Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Luosuu commented Feb 18, 2026

Uh oh!

Uh oh!

Luosuu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments