Training termination for grpo/main with finite step limit #376

DNXie · 2025-10-10T18:45:13Z

Previously the GRPO loop ran indefinitely with both training and rollout tasks active.
This PR adds a cfg.training.steps limit so training stops after the specified number of steps, then cleanly terminates rollout tasks.

Now logs end with messages like (I tested with cfg.training.steps=2):

Reached training limit (2 steps). Exiting continuous_training loop.
Shutting down...
WandbBackend global_controller: Finished run
Health loop stopped gracefully.
Health loop stopped gracefully.
Health loop stopped gracefully.
... 
Shutting down provisioner..

felipemello1 · 2025-10-10T18:48:20Z

apps/grpo/main.py

                # Flush metrics every training step to WandB
                await mlogger.flush.call_one(training_step)

+        print(


can we get the logger from src/forge/util/logging.py instead?

We don't really use the logger elsewhere in the main script do we?

The entire main script is using print instead of logger right now.

joecummings · 2025-10-10T18:47:58Z

apps/grpo/main.py

    async def continuous_training():
        training_step = 0
        restart_tracer = True  # Flag to control when to restart tracer
+        max_steps = cfg.trainer.training.get("steps", None)


Can you put this at the top of the main() loop? And also make it required please. The default can still be null / None, but this way it's very visible.

felipemello1 · 2025-10-10T18:51:43Z

apps/grpo/main.py

    training_task = asyncio.create_task(continuous_training())

    try:
-        await asyncio.gather(*rollout_tasks, training_task)
+        await training_task
    except KeyboardInterrupt:
        print("Training interrupted by user")
+    finally:
+        print("Shutting down...")
+
        for rollout_task in rollout_tasks:
            rollout_task.cancel()
+        # graceful await all tasks, ignore cancellation noise
+        await asyncio.gather(*rollout_tasks, return_exceptions=True)
        training_task.cancel()


can you explain this change a bit? before we would do gather in the try, and they would be impacted by KeyboardInterrupt. Now the gather happens after KeyboardInterrupt. Would the user possibly have to run 'KeyboardInterrupt' twice?

can you explain this change a bit? before we would do gather in the try, and they would be impacted by KeyboardInterrupt. Now the gather happens after KeyboardInterrupt. Would the user possibly have to run 'KeyboardInterrupt' twice?

I think after the first KeyboardInterrupt, all the tasks are canceled and now we just gather on the canceled tasks (which should be fast to resolve) for graceful shutdown.

The previous gather(*rollout_tasks, training_task) call blocked on rollouts indefinitely, even after training_task completed (e.g., once max_steps was reached).
Since we now have a finite step limit that cleanly terminates training_task, we shouldn’t continue waiting on rollout tasks.

With this change, rollout termination is handled explicitly in the finally block.
This doesn’t theoretically change how KeyboardInterrupt is handled, since an interrupt caught in the except would still flow into finally.
However, we’ve already seen issues with KeyboardInterrupt handling not working properly (see #360 (2)); I’ll look into that separately in a follow-up PR.

apps/grpo/main.py

src/forge/controller/service/replica.py

casteryh · 2025-10-10T19:47:17Z

apps/grpo/main.py

        restart_tracer = True  # Flag to control when to restart tracer

-        while True:
+        while max_steps is None or training_step < max_steps:


I think we should make `cfg.training.steps' a required argument. And explicitly setting to -1 means run until interrupted.

Made max_steps

max_steps = cfg.trainer.training.steps or -1

and the condition

while max_steps < 0 or training_step < max_steps:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

apps/grpo/main.py

allenwang28 · 2025-10-13T18:38:17Z

apps/grpo/main.py

+        # graceful await all tasks, ignore cancellation noise
+        await asyncio.gather(*rollout_tasks, return_exceptions=True)
+        # Give replicas time to drain and complete in-flight requests
+        await asyncio.sleep(1)


I think we can/should make this part more graceful. Proposal:

Continuous rollouts takes a shutdown event:

async def continuous_rollouts(shutdown_evnt: asyncio.Event): ... while not shutdown_event.is_set(): # no more while True ... print("Rollout loop got shutdown event, shutting down...")

then in our finally we can do a 2-phased shutdown:

finally: print("Shutting down...") shutdown_event.set() try: # give tasks a chance to exit gracefully await asyncio.wait_for( asyncio.gather(*rollout_tasks, return_exceptions=True), timeout=5 ) except asyncio.TimeoutError: print("Forcing cancellation...") for t in rollout_tasks: t.cancel() await asyncio.gather(*rollout_tasks, return_exceptions=True)

allenwang28

LGTM as long as it's still runnig correctly!

add stopping mechanism

ef013c3

DNXie requested review from allenwang28 and joecummings October 10, 2025 18:45

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025

DNXie requested a review from Jack-Khuu October 10, 2025 18:45

felipemello1 reviewed Oct 10, 2025

View reviewed changes

joecummings reviewed Oct 10, 2025

View reviewed changes

felipemello1 reviewed Oct 10, 2025

View reviewed changes

solve shutdown error; move max_step to main

e966d5c

DNXie requested review from casteryh, felipemello1 and joecummings October 10, 2025 19:29

felipemello1 reviewed Oct 10, 2025

View reviewed changes

apps/grpo/main.py Outdated Show resolved Hide resolved

DNXie commented Oct 10, 2025

View reviewed changes

src/forge/controller/service/replica.py Outdated Show resolved Hide resolved

remove unnecessary check

bb711fe

casteryh reviewed Oct 10, 2025

View reviewed changes

DNXie added 2 commits October 10, 2025 12:55

<Replace this line with a title. Use 1 line only, 67 chars or less>

36cc2da

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

update replica

ea9d09c

Jack-Khuu reviewed Oct 10, 2025

View reviewed changes

apps/grpo/main.py Outdated Show resolved Hide resolved

fix nit

3b61452

allenwang28 reviewed Oct 13, 2025

View reviewed changes

add async event

283f305

DNXie requested a review from allenwang28 October 13, 2025 19:18

allenwang28 approved these changes Oct 13, 2025

View reviewed changes

Merge branch 'main' into grpo_step

aeef847

DNXie merged commit 06a0ae7 into meta-pytorch:main Oct 13, 2025
6 checks passed

Training termination for grpo/main with finite step limit #376

Training termination for grpo/main with finite step limit #376

Uh oh!

Conversation

DNXie commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casteryh Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DNXie Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

casteryh Oct 10, 2025 •

edited

Loading

DNXie Oct 10, 2025 •

edited

Loading