[data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller by LLLLxmmm · Pull Request #43 · TransferQueue/verl

LLLLxmmm · 2025-11-07T07:30:04Z

…s for Train/Val/Test in controller

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Summary by CodeRabbit

Refactor
- Consolidated data system handling and simplified transfer queue client initialization architecture.
- Unified storage and controller management into a single data system flow.
- Transitioned from global_step-based to partition_id-based data organization across training and validation workflows.
Chores
- Updated TransferQueue dependency to latest version.

…s for Train/Val/Test in controller

0oshowero0 · 2025-11-07T08:10:12Z

recipe/transfer_queue/ray_trainer.py

+        self.config = OmegaConf.merge(tq_config, self.config)

        # 4. create client
        # each client should be allocated to exactly one controller


Suggested change

# each client should be allocated to exactly one controller

Copilot

Pull Request Overview

This PR refactors the TransferQueue data system to use a unified controller/client architecture instead of separate train and validation systems. The key changes include:

Consolidates dual data systems (train and validation) into a single unified TransferQueue system
Replaces global_step parameter with partition_id for better data partitioning
Updates TransferQueue dependency to a newer commit version
Removes the concept of separate validation clients

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
verl/utils/transferqueue_utils.py	Removed separate validation client; simplified client creation to use single controller with storage manager initialization
verl/single_controller/base/worker.py	Updated client creation signature to pass unified controller info and config instead of separate train/val parameters
requirements_transferqueue.txt	Updated TransferQueue dependency to newer commit hash (862b74a)
recipe/transfer_queue/ray_trainer.py	Consolidated train/val data systems into unified system; replaced `global_step` with `partition_id` prefixed by train/val context
recipe/transfer_queue/agent_loop.py	Updated to use unified controller info and config parameters

Comments suppressed due to low confidence (1)

recipe/transfer_queue/ray_trainer.py:708

The 'validate' flag is still being set in extra_info (lines 708, 738) but the TransferQueue system no longer uses this flag to route data to separate validation clients after the refactor. This flag appears to only be used by rollout workers to determine sampling behavior. Consider documenting this usage or removing it if it's truly unused after the consolidation of train/val data systems.

                "validate": True,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

0oshowero0

LGTM

coderabbitai · 2025-11-07T08:18:57Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Unified transfer queue data system architecture by consolidating per-role (train/val) initialization into a single controller-config-based approach. Refactored method signatures across agent loop, worker, and utility modules to use unified controller_info and config parameters. Updated RPC patterns from global_step-based to partition_id-based namespacing. Updated dependency commit.

Changes

Cohort / File(s)	Change Summary
Transfer Queue Client API Consolidation `recipe/transfer_queue/agent_loop.py`, `verl/single_controller/base/worker.py`, `verl/utils/transferqueue_utils.py`	Updated `create_transferqueue_client()` method signatures across all three files from `(controller_infos, storage_infos, role)` to `(controller_info, config)`. Simplified client initialization to use single controller info and config payload. In `transferqueue_utils.py`, removed `_VAL_TRANSFER_QUEUE_CLIENT` global variable and eliminated `get_val_transferqueue_client()` public method, consolidating to single `_TRANSFER_QUEUE_CLIENT` for both train/val data flows.
Ray Trainer Data System Unification `recipe/transfer_queue/ray_trainer.py`	Introduced unified `_initialize_data_system()` method replacing separate `_initialize_train_data_system()` and `_initialize_val_data_system()` flows. Changed storage creation from `TransferQueueStorageSimpleUnit` to `SimpleStorageUnit`. Updated data access patterns from train/val-specific clients to single `self.data_system_client`. Refactored RPC calls from global_step-based to partition_id-based namespacing (e.g., `train_{step}`, `val_{step}`). Updated worker initialization to use new `(controller_info, config)` signature. Restructured config propagation via `OmegaConf.merge()` with `allow_objects=True` for `ZMQServerInfo` preservation.
Dependency Update `requirements_transferqueue.txt`	Updated TransferQueue git commit pinning from `68c04e7` to `862b74a`.

Sequence Diagram(s)

sequenceDiagram
    participant RayTrainer as RayPPOTrainer
    participant DataSys as _initialize_data_system()
    participant Controller as TransferQueueController
    participant Storage as SimpleStorageUnit
    participant Client as TransferQueueClient

    Note over RayTrainer,Client: New Unified Data System Flow

    RayTrainer->>DataSys: Initialize single data system
    DataSys->>DataSys: Compute train/val data sizes
    DataSys->>Storage: Create SimpleStorageUnit
    DataSys->>Controller: Create TransferQueueController
    DataSys->>Controller: Register storage_unit_infos
    DataSys->>RayTrainer: Return controller_info, storage_infos

    rect rgb(220, 240, 220)
    Note over RayTrainer,Client: Worker Setup (Old: multiple clients per role)
    RayTrainer->>Client: create_transferqueue_client(controller_info, config)
    Client->>Client: initialize_storage_manager(AsyncSimpleStorageManager)
    end

    rect rgb(240, 220, 220)
    Note over RayTrainer,Client: RPC Pattern Change (Old: global_step-based)
    RayTrainer->>Storage: async_put(key="train_{step}", data)
    RayTrainer->>Storage: async_put(key="val_{step}", data)
    Storage-->>RayTrainer: Partition_id-based namespacing
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25–40 minutes

recipe/transfer_queue/ray_trainer.py: Requires careful attention to the unified data system initialization logic, RPC pattern changes from global_step to partition_id-based keys, and OmegaConf config merging strategy.
verl/utils/transferqueue_utils.py: Review consolidation of train/val client branching logic into single client flow; verify initialize_storage_manager() call replaces previous multi-storage setup correctly.
Cross-file signature consistency: Verify all callers of create_transferqueue_client() across agent_loop.py, ray_trainer.py, and worker.py correctly pass the new (controller_info, config) parameters.
Dependency bump: Confirm TransferQueue commit 862b74a is compatible with API changes and no breaking changes are introduced.

Poem

🐰 One queue, unified flow,
Train and val now dance together,
No more roles split in two,
Partition keys namespace the way,
Simpler paths make carrots grow! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title mentions managing multiple data partitions for Train/Val/Test, but the actual changes consolidate separate train/val data systems into a unified single data system with partition-based namespacing, which is contrary to the title's implication of multiple independent partitions.	Consider revising the title to reflect the actual refactoring, such as 'Refactor TransferQueue to unify train/val data systems with partition-based routing' or similar, which better captures the consolidation of multiple data systems into a single controller-based system.
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

0oshowero0 · 2025-11-07T08:19:54Z

@coderabbitai review

coderabbitai · 2025-11-07T08:20:16Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

recipe/transfer_queue/ray_trainer.py (1)
447-451: Document why allow_objects=True is necessary for config merge.

The code creates a temporary config with allow_objects=True to preserve ZMQServerInfo instances during the merge. While this is correct, consider adding a comment explaining why this flag is required, as it's not immediately obvious and could be accidentally removed during refactoring.

Apply this diff to add clarifying documentation:
         tq_config = OmegaConf.create({}, flags={"allow_objects": True})  # Note: Need to generate a new DictConfig
-        # with allow_objects=True to maintain ZMQServerInfo instance. Otherwise it will be flattened to dict
+        # with allow_objects=True to preserve ZMQServerInfo instances (which contain socket connection details).
+        # Without this flag, OmegaConf would flatten these objects to dicts, breaking the transfer queue client initialization.
         tq_config.controller_info = self.data_system_controller_info

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20d0f98 and 0dc3ddd.

📒 Files selected for processing (5)

recipe/transfer_queue/agent_loop.py (1 hunks)
recipe/transfer_queue/ray_trainer.py (15 hunks)
requirements_transferqueue.txt (1 hunks)
verl/single_controller/base/worker.py (1 hunks)
verl/utils/transferqueue_utils.py (3 hunks)

🔇 Additional comments (8)

recipe/transfer_queue/agent_loop.py (1)

70-72: LGTM! Signature simplified to unified controller/config pattern.

The method signature has been updated to use a single controller_info and config instead of separate controller_infos, storage_infos, and role parameters. This aligns with the unified data system architecture described in the PR.

verl/single_controller/base/worker.py (1)

134-141: LGTM! Worker-side signature updated to match unified API.

The signature change removes the role parameter and consolidates controller_infos/storage_infos into controller_info/config. The client ID now uses a simpler worker_{rank} format instead of incorporating the role.

verl/utils/transferqueue_utils.py (2)

45-52: LGTM! Unified transfer queue client initialization.

The function now creates a single AsyncTransferQueueClient and initializes its storage manager in one place, removing the complexity of managing separate train/val clients. This aligns with the unified data system architecture.

104-104: All visible callsites have been successfully updated to use the unified client.

The verification found:

No references to old get_val_transferqueue_client or get_train_transferqueue_client functions

Only _TRANSFER_QUEUE_CLIENT used in transferqueue_utils.py (line 104)

Partition-based routing is explicit in ray_trainer.py via partition_id parameters to async_get_meta(), async_put(), and async_clear()

All calls in the decorator pattern (lines 172, 174) correctly convert BatchMeta to DataProto through the unified client

However, the partition routing mechanism at line 104 relies on AsyncTransferQueueClient (imported from external transfer_queue module), which cannot be inspected in this codebase. The routing assumes the client maintains partition context from async_get_meta() through to async_get_data(), but this contract cannot be verified without access to the external dependency.

recipe/transfer_queue/ray_trainer.py (3)

414-460: LGTM! Unified data system initialization consolidates train/val partitions.

The new _initialize_data_system method creates a single TransferQueueController and shared storage units sized for combined train+val data. The use of OmegaConf.create with allow_objects=True correctly preserves ZMQServerInfo instances when merging into config.

671-678: Partition_id naming is consistent across all RPC calls.

Verification confirms all async_put, async_get_meta, async_get_data, and async_clear operations for validation data consistently use f"val_{self.global_steps - 1}", and training data consistently use f"train_{self.global_steps - 1}". The self.global_steps - 1 offset is correctly applied uniformly throughout.

1268-1270: Concern verified and resolved.

The code correctly calls async_put at line 1269 before any subsequent async_get_meta or async_get_data operations on the same partition. Specifically:

Line 1269: async_put writes data to partition_id=f"train_{self.global_steps - 1}"

Lines 1273+: async_get_meta and async_get_data read from the same partition_id

All read operations occur after the write, preventing consumer failures.

requirements_transferqueue.txt (1)

2-2: Avoid using a same-day commit as a dependency; consider waiting for a stable release.

The commit 862b74a exists in the upstream repository, but it was committed today (2025-11-07 03:01 UTC). This is far too recent to be considered stable. The commit is a bug fix ("Close zmq socket for storage manager"), but lacks the testing time required for production use. Wait for an official release or a commit with a proven track record.

[data] feat: TransferQueue - Support managing multiple data partition…

d6d12a0

…s for Train/Val/Test in controller

LLLLxmmm changed the title ~~[data] feat: TransferQueue - Support managing multiple data partition…~~ [data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller Nov 7, 2025

LLLLxmmm requested review from 0oshowero0, baymax591 and ji-huazhong and removed request for baymax591 and ji-huazhong November 7, 2025 07:47

fix codecheck

d81d3d1

0oshowero0 reviewed Nov 7, 2025

View reviewed changes

0oshowero0 requested a review from Copilot November 7, 2025 08:15

Copilot AI reviewed Nov 7, 2025

View reviewed changes

0oshowero0 approved these changes Nov 7, 2025

View reviewed changes

fix comments

0dc3ddd

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

fix comments

9713725

0oshowero0 merged commit 9f00d21 into TransferQueue:main_tq_submodule Nov 7, 2025
5 checks passed

coderabbitai bot mentioned this pull request Nov 18, 2025

Support managing multiple data partitions for Train/Val/Test in controller #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller#43

[data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller#43
0oshowero0 merged 4 commits intoTransferQueue:main_tq_submodulefrom
LLLLxmmm:data_partitions

LLLLxmmm commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

0oshowero0 Nov 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

0oshowero0 left a comment

Uh oh!

coderabbitai bot commented Nov 7, 2025 •

edited

Loading

Review skipped

Uh oh!

0oshowero0 commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LLLLxmmm commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Summary by CodeRabbit

Uh oh!

0oshowero0 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

0oshowero0 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

0oshowero0 commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LLLLxmmm commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 7, 2025 •

edited

Loading