[Refactor] refactor packing in RL train controller and train worker #1393

YanhuiDua · 2025-12-24T12:22:27Z

Motivation

Current xtuner data distribution mechanism has a pack allocation issue that leads to unstable training steps and affects training effectiveness.

The data distribution pipeline consists of three stages:

Packing Stage: Split input data_batch by token count, creating one pack per 32K tokens, resulting in N packs
Distribution Stage: Evenly distribute N packs across M workers, each worker receives N/M packs
Step Division: Divide packs per worker into steps based on optimizer_step parameter

When N/M is not divisible by optimizer_step, the actual training steps fail to match the expected value.
For example:

N/M = 44                          # packs per worker
optimizer_step = 16               # expected training steps
packs_per_step = ⌈44/16⌉ = 3      # packs allocated per step

# Actual result:
actual_steps = ⌊44/3⌋ = 14        # complete steps
# Total: 15 steps with inconsistent batch sizes

Key Changes

1. Token-aware Pre-allocation

In RawTrainingController.fit() (controller.py), samples are evenly distributed into M workers and further split into optimizer_step buckets for each worker, based on token count. This ensures balanced token distribution across all workers and steps:

batches_per_dp_group = self._balance_split_batch(data_batches, dp_size)
mini_batch_for_steps = self._balance_split_batch(dp_worker_data_batches, optimizer_steps)

2. Pack & Pad per Bucket

Within each pre-allocated bucket, data is packed and padded so that each pack does not exceed pack_max_length. Padding is applied where necessary, and the number of packs per step is aligned across all workers:

batch4pack_list = self._rearrange_batch_for_pack(step_mini_batch, pack_max_length)
step_pack = self._pad_and_pack_batches(batch4pack, pack_max_length)
self._pad_to_max_packs_across_workes(packed_data_batches, step_idx, max_packs, pack_max_length)

3. Worker-side Training

In TrainingWorker.fit() (worker.py), each worker processes its assigned data, including sequence context resolution, logprobs computation, importance sampling correction, and the actual training step:

seq_ctx = self._resolve_ray_data(data["seq_ctx"], language_cfg)
self.compute_actor_logprobs()
self._apply_rollout_is_correction()
train_step()

Copilot

Pull request overview

This PR refactors the packing logic in the RL training controller and worker components to improve token balancing and code organization. The key changes introduce a Karmarkar-Karp algorithm for balanced partitioning, extract helper methods for better code maintainability, and restructure how data batches are distributed across workers.

Key Changes

Introduces sequence-length balanced partitioning using the Karmarkar-Karp differencing algorithm to better distribute workload across devices
Refactors worker's fit method to accept nested list structure list[list[WorkerInputItem]] instead of flat list, aligning with the new per-step packing approach
Extracts reusable helper methods (_resolve_ray_data, _apply_rollout_is_correction, _create_padding_sample, _pack, _balance_split_batch) to reduce code duplication and improve maintainability

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 12 comments.

File	Description
xtuner/v1/rl/utils.py	Adds Karmarkar-Karp algorithm implementation with `get_seqlen_balanced_partitions` function for balanced workload distribution across partitions
xtuner/v1/rl/base/worker.py	Refactors `fit` method to handle nested batch structure, extracts ray data resolution and importance sampling logic into separate methods, adds `get_worker_cfg` accessor method
xtuner/v1/rl/base/controller.py	Major refactoring of packing logic with new balanced splitting, padding creation, and improved data distribution across workers with per-step gradient accumulation support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

xtuner/v1/rl/base/controller.py

xtuner/v1/rl/base/worker.py

xtuner/v1/rl/base/controller.py

Copilot · 2025-12-24T12:28:08Z

xtuner/v1/rl/utils.py

+# Adapted from https://github.com/volcengine/verl/blob/main/verl/utils/seqlen_balancing.py
+def karmarkar_karp(seqlen_list: list[int], k_partitions: int, equal_size: bool):
+    # see: https://en.wikipedia.org/wiki/Largest_differencing_method
+    class Set:


This class implements lt, but does not implement le or ge.

Copilot · 2025-12-24T12:28:08Z

xtuner/v1/rl/utils.py

+                return len(self.items) < len(other.items)
+            return self.items < other.items
+
+    class State:


This class implements lt, but does not implement le or ge.

xtuner/v1/rl/base/controller.py

jayhenry · 2025-12-25T15:19:37Z

xtuner/v1/rl/base/controller.py

+            get_logger().info(f"default split into {dp_size} partitions with tokens: {tokens_in_partition}")
+
+        packed_data_batches: list[list[list[dict]]] = [[[] for _ in range(optimizer_steps)] for _ in range(dp_size)]
+        max_packs_per_card = [0] * optimizer_steps


rename to max_packed_batch_num_per_step

max_packs_per_step 更加准确一些：每步最大的packs数

xtuner/v1/rl/base/controller.py

jayhenry · 2025-12-25T15:35:13Z

xtuner/v1/rl/base/worker.py

+
+        # old logprobs are inplaced updated in compute_actor_logprobs
+        loss_ctx_input_list = self.compute_actor_logprobs(seq_ctx_list, loss_ctx_input_list)
+        loss_ctx_input_list, metrics = self._apply_rollout_is_correction(


Great！原来很长的fit函数变得有层次更易读了

xtuner/v1/rl/base/worker.py

xtuner/v1/rl/utils.py

xtuner/v1/rl/base/controller.py

jayhenry · 2025-12-26T03:10:25Z

Motivation

Current xtuner data distribution mechanism has a pack allocation issue that leads to unstable training steps and affects training effectiveness.

The data distribution pipeline consists of three stages:

Packing Stage: Split input data_batch by token count, creating one pack per 32K tokens, resulting in N packs

Distribution Stage: Evenly distribute N packs across M workers, each worker receives N/M packs

Step Division: Divide packs per worker into steps based on optimizer_step parameter

When N/M is not divisible by optimizer_step, the actual training steps fail to match the expected value. For example:
N/M = 44                          # packs per worker
optimizer_step = 16               # expected training steps
packs_per_step = ⌈44/16⌉ = 3      # packs allocated per step

# Actual result:
actual_steps = ⌊44/3⌋ = 14        # complete steps
# Total: 15 steps with inconsistent batch sizes
Key Changes

This PR refactors the pipeline to: Allocate → Pack & Pad and wrappers some methods from TrainController and TrainWorker

Token-aware pre-allocation：Evenly distribute samples into M workers (optional) × optimizer_step buckets based on token count

Pack & pad per bucket: Apply packing and padding within each pre-allocated bucket

Great PR description!
Maybe you can add the calling chain of core packing functions responding to the workflow in key changes, such as

controller.py: 
RawTrainingController.fit() 
# 1. Token-aware pre-allocation：Evenly distribute samples into M workers (optional) × optimizer_step buckets based on token count
-> batches_per_dp_group = self._balance_split_batch(data_batches, dp_size)
-> mini_batch_for_steps = self._balance_split_batch(dp_worker_data_batches, optimizer_steps)
# 2. Pack & pad per bucket: Apply packing and padding within each pre-allocated bucket
-> batch4pack_list = self._rearrange_batch_for_pack(step_mini_batch, pack_max_length)   # the old version: pack_mini_batch = self._pack(step_mini_batch, pack_max_length)
-> self._pack_batches()  # pieces of packing code which is better to be wrapped in the new function `_pack_batches()`
worker.py:
-> self._create_padding_sample()  # pieces of padding code which is better to be wrapped in a new function `_pad_batches()`
# 3. use the packed padded data batches
->TrainingWorker.fit()

Then I can review easily as the same order above : )

Additionally, when you write the core function calling chain responding to your original design in the "Key Changes", you will find that there are some high-level functions missing in your implementation, just like the _pad_batches(). If you add the high-level function, then you can tell others more clearly, and others can read code more easily, because code readers can read and think this in the high level ignoring the messy details.

The unit test can play the same role sometimes. For example, if you want to write unit test to test the core padding function, then you need to abstract the related code pieces into the function _pad_batches() and test it.

xtuner/v1/rl/base/controller.py

xtuner/v1/rl/base/worker.py

hhaAndroid · 2026-01-06T04:35:21Z

xtuner/v1/rl/base/worker.py

+        del data_batches
+
+        # old logprobs are inplaced updated in compute_actor_logprobs
+        loss_ctx_input_list = self.compute_actor_logprobs(seq_ctx_list, loss_ctx_input_list)


这里有一个优化项。self._resolve_ray_data 是一个相对耗时的操作，可以和 self.compute_actor_logprobs overlap 计算，从而掩盖掉跨节点数据读取开销。

具体咋写可能有点麻烦，如果暂时不想改，可以加一个 TODO

done, 先写todo了

hhaAndroid · 2026-01-06T06:02:14Z

xtuner/v1/rl/base/controller.py

+        assert world_size % self.data_replicate_size == 0, "world_size must be divisible by data_replicate_size"
+        optimizer_steps = self.worker_cfg.optimizer_steps
+
+        batches_per_dp_group: list[list[WorkerInputItem]]


可能有些 corner case 没有考虑。比如 optimizer_steps=16，但是数据条数不够 16，代码是否会报错。建议这种可以写严谨的单元测试来覆盖

hhaAndroid · 2026-01-06T06:02:52Z

xtuner/v1/rl/base/controller.py

            handles.append(
                worker.fit.remote(  # type: ignore[attr-defined]
-                    data_batches=packed_data_batches[(worker_idx // data_replicate_size) :: dp_size],
+                    data_batches=packed_data_batches[worker_idx // self.data_replicate_size],


::dp_size 这个逻辑不能去掉

xtuner/v1/rl/base/controller.py

hhaAndroid · 2026-01-06T06:44:54Z

rl_trainer 里面有 random.shuffle(data_batches) 要删掉

…tegy

YanhuiDua requested review from HIT-cwh, Copilot and hhaAndroid December 24, 2025 12:22

Copilot started reviewing on behalf of YanhuiDua December 24, 2025 12:23 View session

Copilot AI reviewed Dec 24, 2025

View reviewed changes

YanhuiDua force-pushed the refactor_pack branch from ce11425 to 62ae9fc Compare December 24, 2025 12:49

YanhuiDua requested a review from jayhenry December 24, 2025 13:11

jayhenry reviewed Dec 25, 2025

View reviewed changes

xtuner/v1/rl/utils.py Outdated Show resolved Hide resolved

jayhenry reviewed Dec 26, 2025

View reviewed changes

xtuner/v1/rl/base/controller.py Outdated Show resolved Hide resolved

YanhuiDua force-pushed the refactor_pack branch from 62ae9fc to 2dfbb8b Compare December 29, 2025 10:45

YanhuiDua added 3 commits January 5, 2026 07:28

[Refactor] refactor packing in RL train controller and train worker

02b2910

add more clear type annotations in TrainController

265707b

add trainingStepTimeLog in train controller

1695d5a

YanhuiDua force-pushed the refactor_pack branch from 2dfbb8b to ca54108 Compare January 5, 2026 07:44

add ut and unify training input to WorkerLogItem

bedd4d4

YanhuiDua force-pushed the refactor_pack branch from ca54108 to bedd4d4 Compare January 5, 2026 08:25

hhaAndroid reviewed Jan 6, 2026

View reviewed changes

xtuner/v1/rl/base/controller.py Outdated Show resolved Hide resolved

xtuner/v1/rl/base/controller.py Outdated Show resolved Hide resolved

xtuner/v1/rl/base/controller.py Outdated Show resolved Hide resolved

hhaAndroid reviewed Jan 6, 2026

View reviewed changes

xtuner/v1/rl/base/controller.py Outdated Show resolved Hide resolved

xtuner/v1/rl/base/controller.py Outdated Show resolved Hide resolved

[Refactor] refactor pack to support greedy, balance, native pack stra…

7bf006e

…tegy

[Refactor] refactor packing in RL train controller and train worker #1393

Are you sure you want to change the base?

[Refactor] refactor packing in RL train controller and train worker #1393

Uh oh!

Conversation

YanhuiDua commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Changes

1. Token-aware Pre-allocation

2. Pack & Pad per Bucket

3. Worker-side Training

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jayhenry Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

YanhuiDua Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jayhenry Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jayhenry commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhaAndroid Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

YanhuiDua Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

hhaAndroid Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

hhaAndroid Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hhaAndroid commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

YanhuiDua commented Dec 24, 2025 •

edited

Loading

YanhuiDua Dec 29, 2025 •

edited

Loading

jayhenry commented Dec 26, 2025 •

edited

Loading