[wip] [draft] Support the single-controller mode #528

garrett4wade · 2025-11-04T05:36:30Z

Description

This PR introduces a significant refactoring to support single-controller training, which provides the following benefits:

Centralized staleness control: Resolves straggler issues when length distributions are imbalanced
Infrastructure agnostic: The script can run anywhere with connectivity to the cluster scheduler
Optimized data communication: Decreased data communication overhead through under-the-hood optimizations
Enhanced scalability: Enables auto-scaling of inference workloads and intelligent request scheduling (in the future)

Current Limitations (Proof of Concept)

This implementation is a proof of concept with the following known limitations:

Only local scheduler support is implemented
Dataset and generated rollouts are not yet distributed across processes
Rollout statistics are not exportable
Auto-scaling and request scheduling are not yet implemented

Comparison to Previous Approaches

This PR differs from #410, #415, and #489 primarily in its approach to RPC security and request scheduling during rollout. It provides a cleaner, more efficient implementation with minimal changes to existing APIs.

Roadmap

Over the next few weeks, we will break down these changes into smaller, focused PRs and merge them incrementally into the main branch.

Related Issue

Fixes #260

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with jb build docs
No critical issues raised by AI reviewers (/gemini review)

Additional Context

Discussion about implementing RolloutController: #469

…ocal-inf-engine

…AReaL into fw/local-inf-engine

…local-inf-engine

…ocal-inf-engine

…AReaL into fw/local-inf-engine

…w/msvt

gemini-code-assist · 2025-11-04T05:38:45Z

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring to support single-controller training, which provides centralized staleness control, infrastructure agnostic, optimized data communication, and enhanced scalability. It includes new scheduling features, a RolloutController implementation, an abstract Scheduler API, and a LocalScheduler implementation.

Highlights

Single-Controller Training: Introduces a refactoring to support single-controller training, centralizing staleness control and optimizing data communication.
New Scheduling Features: Adds SchedulingStrategy and SchedulingSpec dataclasses to cli_args.py, enabling more flexible resource allocation and task scheduling.
RolloutController Implementation: Implements RolloutController to manage distributed InferenceEngine workers, including request scheduling, load balancing, and staleness control.
Abstract Scheduler API: Defines an abstract Scheduler API with methods for worker creation, deletion, and engine management.
Local Scheduler Implementation: Provides a LocalScheduler implementation for managing worker subprocesses on a single node.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a major and well-executed refactoring to support a single-controller architecture for training. The changes significantly improve modularity, API clarity, and testability by introducing dedicated controllers, a new scheduler API, and a more robust RPC mechanism. The new LocalScheduler and the controller implementations are comprehensive. My feedback focuses on a few areas to enhance robustness and maintainability, such as improving the safety of subprocess creation, making data merging logic less heuristic, and refining the use of magic strings for control flow.

gemini-code-assist · 2025-11-04T05:41:47Z

areal/controller/train_controller.py

+                if "attention_mask" in first_result:
+                    return DistributedBatchMemory.concat(
+                        [DistributedBatchMemory.from_dict(r) for r in results]
+                    )


The logic to decide how to merge dictionaries by checking for the presence of "attention_mask" is a bit heuristic and could be brittle. If a batch-like dictionary that needs padding doesn't happen to have this key, or a non-batch dictionary does, the merging logic will be incorrect.

A more robust approach would be to have a more explicit way to identify batch data that needs special concatenation. For example, the remote method could wrap batch data in a specific container class, or return metadata indicating the type of the result.

This would make the merging logic more reliable and less dependent on convention.

gemini-code-assist · 2025-11-04T05:41:47Z

areal/engine/sglang_remote.py

+    def wait_quiet(
+        self, count: int, timeout: float | None = None
+    ) -> dict[str, Any] | None:
+        try:
+            return self._engine.wait(count, timeout=timeout)
+        except TimeoutError:
+            return "NO_RESULT"


The wait_quiet method's return type hint dict[str, Any] | None does not match its implementation, which can also return the string "NO_RESULT". This can lead to type checker errors and confusion.

Additionally, using a magic string like "NO_RESULT" to signal a timeout is not ideal. It makes the code harder to understand and maintain.

Consider one of these alternatives:

Update the type hint to dict[str, Any] | None | Literal["NO_RESULT"] to be accurate, as suggested below.

A better approach would be to avoid the magic string. Let this method re-raise the TimeoutError or a custom exception, and have the caller in RolloutController handle it. This would make the control flow clearer.

If you want to avoid exceptions, return a sentinel object instead of a string, e.g., NO_RESULT = object(). This is safer than string comparisons.

Suggested change

def wait_quiet(

self, count: int, timeout: float | None = None

) -> dict[str, Any] | None:

try:

return self._engine.wait(count, timeout=timeout)

except TimeoutError:

return "NO_RESULT"

def wait_quiet(

self, count: int, timeout: float | None = None

) -> dict[str, Any] | None | str:

try:

return self._engine.wait(count, timeout=timeout)

except TimeoutError:

return "NO_RESULT"

gemini-code-assist · 2025-11-04T05:41:47Z

areal/scheduler/local.py

+                    process = subprocess.Popen(
+                        cmd,
+                        shell=isinstance(cmd, str),
+                        stdout=sys.stdout,
+                        stderr=sys.stdout,
+                    )


The use of shell=True with subprocess.Popen (implicitly set because cmd is a string) can be a security risk if any part of the command string is derived from untrusted input. While the risk seems low in this context, it's best practice to avoid shell=True. The command string is constructed to use a pipe (| tee), which necessitates shell=True.

A safer approach would be to manage the output redirection in Python by reading from the subprocess's stdout/stderr and writing to both the log file and sys.stdout. This would allow you to pass the command as a list of arguments to Popen without shell=True.

For example, this could be done in a separate thread to avoid blocking:

# cmd_list should be a list of arguments, not a string with open(log_file, "ab") as log_f: process = subprocess.Popen( cmd_list, env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT ) for line in iter(process.stdout.readline, b''): sys.stdout.buffer.write(line) sys.stdout.flush() log_f.write(line)

This is more complex but avoids the risks of shell injection and makes stream handling more explicit.

…ocal-inf-engine

garrett4wade and others added 30 commits October 22, 2025 21:47

.

edf7b5d

.

337e71a

merge main

5ab09a2

Merge branch 'main' of https://github.com/inclusionAI/AReaL into fw/l…

427f3b0

…ocal-inf-engine

minor fix import

a9dad5a

Merge branch 'main' of https://github.com/inclusionAI/AReaL into fw/l…

33a626b

…ocal-inf-engine

Merge branch 'main' of https://github.com/inclusionAI/AReaL into fw/l…

fa0bfd0

…ocal-inf-engine

merge inferece engine tests

f660e5b

update

78b489d

Merge branch 'fw/local-inf-engine' of https://github.com/inclusionAI/…

b0ecf14

…AReaL into fw/local-inf-engine

fix

722afad

merge main

17945e9

.

7a2f6a9

add local scheduler

46ee150

merge main

b1eefc1

Merge branch 'fw/ls' of https://github.com/inclusionAI/AReaL into fw/…

e471c1e

…local-inf-engine

implement run workflow endpoint and rolllout controller

266d6d6

add tensor serialization

f67dd60

fix test

a58c984

add scheduler and rollout controller test

d14b53c

fix docstring and type annotations

b3a3e53

merge train controller commit

f223db1

Merge branch 'main' of https://github.com/inclusionAI/AReaL into fw/l…

2969c9f

…ocal-inf-engine

add train controller

a58d0cc

init commit train controller

e049f30

refactor train controller

b4c4eb6

add train controller tests

5a702a1

renaming

54ee6fd

.

b21e452

update train script

170cc75

garrett4wade and others added 20 commits November 2, 2025 13:26

Merge branch 'fw/local-inf-engine' of https://github.com/inclusionAI/…

2830fac

…AReaL into fw/local-inf-engine

minor config fix

7e133b8

merge tests

73912a8

fix docstring

a822cb2

add test

6e62884

fix format

98d2c8d

shorter ctx len for test

12cc12e

add adv norm in grpo test

3ba98e6

update test to use local path

204b1fd

resource cleanup in tests

95a08ac

fix vllm pp

d0dfad7

fix

6749916

.

52921f2

.

9258e2e

Merge branch 'fw/msvt' of https://github.com/inclusionAI/AReaL into f…

b640c11

…w/msvt

.

4443c9b

merge main

c664acc

add assertion

ac6a11a

merge main

e85a39f

merge

d212a38

garrett4wade marked this pull request as draft November 4, 2025 05:36

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

revert and fix

54fce97

garrett4wade mentioned this pull request Nov 4, 2025

[Refactor] Break down single-controller training support into focused PRs #529

Open

2 tasks

garrett4wade added 2 commits November 10, 2025 20:31

merge main

eeffd1d

minor revert

749e047

garrett4wade mentioned this pull request Nov 14, 2025

[feat] add Serializer for rpc server #566

Open

16 tasks

garrett4wade added 2 commits November 14, 2025 15:26

merge main

41c2407

Merge branch 'main' of https://github.com/inclusionAI/AReaL into fw/l…

3b25e2f

…ocal-inf-engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip] [draft] Support the single-controller mode #528

[wip] [draft] Support the single-controller mode #528

garrett4wade commented Nov 4, 2025

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[wip] [draft] Support the single-controller mode #528

Are you sure you want to change the base?

[wip] [draft] Support the single-controller mode #528

Conversation

garrett4wade commented Nov 4, 2025

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants