[Debugging] Create a stand alone trainer app #313

JenniferWang · 2025-10-04T02:45:44Z

Very similar to the stand-alone vllm app, this trainer app is introduced to make investigating trainer OOM faster. This could be very useful for single-node trainer because you can run it locally and the system metrics are much easier to obtain.

Test

Change the activation checkpointing config in apps/grpo/qwen3_32b.yaml and reproduce the OOM.

  activation_checkpoint:
    mode: selective
    selective_ac_option: op

The repro is blazingly fast :)

https://meta.wandb.io/jiyue/grpo-training/runs/8qe73q1b?nw=nwuserjiyue

allenwang28 · 2025-10-04T17:35:21Z

apps/trainer/main.py

+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+# Usage: python -m apps.trainer.main --config apps/grpo/qwen3_32b.yaml


nit, can we re-name this to rl_trainer? The reason is because we have sft, sft_v2 and then if we have a separate trainer it may confuse even further lol

allenwang28

just file name change, but looks good to me!

We should consider in the near future another location for this and everything that isn't GRPO, these should probably formulate the basis of many of our integration tests

JenniferWang added 2 commits October 3, 2025 21:55

initial commit

8d3afda

update

d7d89d5

JenniferWang requested a review from allenwang28 October 4, 2025 02:45

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 4, 2025

format

26d0cb3

JenniferWang marked this pull request as ready for review October 4, 2025 02:47

allenwang28 reviewed Oct 4, 2025

View reviewed changes

allenwang28 approved these changes Oct 4, 2025

View reviewed changes

rename

4f287ec

JenniferWang merged commit 61c9775 into main Oct 6, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Debugging] Create a stand alone trainer app #313

[Debugging] Create a stand alone trainer app #313

JenniferWang commented Oct 4, 2025

Uh oh!

allenwang28 Oct 4, 2025

Uh oh!

allenwang28 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Debugging] Create a stand alone trainer app #313

[Debugging] Create a stand alone trainer app #313

Conversation

JenniferWang commented Oct 4, 2025

Test

Uh oh!

allenwang28 Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants