spawn service based trainer #131

pradeepfn · 2025-09-05T16:48:10Z

1\ The existing code uses spawn_actor API. this is being deprecated. This diff ports the example to new service API.
2\ With new service API, procs/host details are part of the Service config. Hence I removed from the yaml config.
3\ With new service API, we don't have to call the setup called explicity, it is being called by the launch_service routines.

Run output:

(forge) [[email protected] ~/forge_fork (ts_trainer)]$ python -m apps.rl.main --config apps/rl/llama3_8b.yaml
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format

Aggregated Logs (2025-09-08 07:40:32) >>>
[1 similar log lines] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
<<< Aggregated Logs (2025-09-08 07:40:39) <<<

[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format

Aggregated Logs (2025-09-08 07:40:33) >>>
[1 similar log lines] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
<<< Aggregated Logs (2025-09-08 07:40:39) <<<

Services initialized....
shutting down...

pbontrager

Thanks for updating this

pbontrager · 2025-09-08T15:17:35Z

apps/rl/main.py

-    await asyncio.gather(
-        buffer.setup.call(),
-        trainer.setup.call(),
+    trainer = await spawn_service(


These should still be in a asyncio.gather

pbontrager · 2025-09-08T15:18:41Z

apps/rl/main.py

-        buffer.setup.call(),
-        trainer.setup.call(),
+    trainer = await spawn_service(
+        ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=4),


This still could have been sourced from the config with ServiceConfig(**cfg.trainer.pop("service")) but we don't have to change that for this PR

Will add as part of next integration test PR. Did not occur to me that the yaml obj get parsed into a regular python dict. thanks.

pradeepfn · 2025-09-08T15:57:02Z

PTAL. thanks!

spawn servic based trainer

b1b3adc

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 5, 2025

working RLtrainer example code, after porting to service API

f15d397

pradeepfn requested a review from pbontrager September 8, 2025 14:54

pbontrager reviewed Sep 8, 2025

View reviewed changes

use asyncio.gather

4e4b279

pbontrager approved these changes Sep 8, 2025

View reviewed changes

pradeepfn merged commit 69c6b1d into meta-pytorch:main Sep 8, 2025
2 of 5 checks passed

photomz pushed a commit to photomz/forge that referenced this pull request Oct 25, 2025

spawn service based trainer (meta-pytorch#131)

0e6d4d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

spawn service based trainer #131

spawn service based trainer #131

Uh oh!

pradeepfn commented Sep 5, 2025 •

edited

Loading

Uh oh!

pbontrager left a comment

Uh oh!

pbontrager Sep 8, 2025

Uh oh!

pbontrager Sep 8, 2025

Uh oh!

pradeepfn Sep 8, 2025

Uh oh!

pradeepfn commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spawn service based trainer #131

spawn service based trainer #131

Uh oh!

Conversation

pradeepfn commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

pbontrager Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

pbontrager Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

pradeepfn Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

pradeepfn commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pradeepfn commented Sep 5, 2025 •

edited

Loading