Refactor Configuration Structure: Move Service Configs to Top-Level and Simplify Policy Initialization #165

DNXie · 2025-09-16T20:44:29Z

As discussed in PR #153

Moved service-related configuration from under policy to a new top-level services section in the config file.
Updated code accordingly.
Removed the exclude_service function and all related usages.
Fixed a log printing issue from Refactor service spawning: add ForgeActor.options().as_service() API #153

Before

trainer:
  model_name: ${model}
  learning_rate: 1e-5
  service:
    procs_per_replica: 1
    num_replicas: 1
    with_gpus: true

Trainer.options(**cfg.trainer.service).as_service(**exclude_service(cfg.trainer))

After

trainer:
  model_name: ${model}
  learning_rate: 1e-5


services:
  trainer:
    procs_per_replica: 1
    num_replicas: 1
    with_gpus: true

Trainer.options(**cfg.services.trainer).as_service(**cfg.trainer)

Test

python -m apps.vllm.main --config apps/vllm/llama3_8b.yaml
python -m apps.vllm.main --config apps/vllm/qwen2_5_32b.yaml
python -m apps.vllm.main --config apps/vllm/deepseek_r1
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
python -m apps.grpo.main --config apps/grpo/qwen3_multinode.yaml

DNXie · 2025-09-16T20:53:15Z

@allenwang28 When I was testing

python -m apps.grpo.main --config apps/grpo/qwen3_multinode.yaml

Because the config specifies hosts_per_replica: 1, I got the error saying

KeyError: "No named resource found for `gpu.small`. Registered named resources: ['NULL', 'MISSING']"

I had to change this line from

image="test", meshes=[f"{name}:{num_hosts}:gpu.small"]

to

image="test", meshes=[f"{name}:{num_hosts}:NULL"]

to make it work.

Does it indicate any errors with my local resource allocation?

allenwang28 · 2025-09-16T20:57:15Z

Re your question, I would expect that the multi-host doesn't run given that you're not running on the SLURM cluster

DNXie added 2 commits September 16, 2025 13:40

remove exclude_service

f6780da

fix the printing bug

7e10eda

DNXie requested review from Jack-Khuu and allenwang28 September 16, 2025 20:44

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 16, 2025

typo

1c072ed

allenwang28 approved these changes Sep 16, 2025

View reviewed changes

DNXie merged commit 1300215 into meta-pytorch:main Sep 16, 2025
5 checks passed

DNXie mentioned this pull request Sep 16, 2025

Refactor service spawning: add ForgeActor.options().as_service() API #153

Merged

DNXie deleted the refactor_config branch September 16, 2025 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Configuration Structure: Move Service Configs to Top-Level and Simplify Policy Initialization #165

Refactor Configuration Structure: Move Service Configs to Top-Level and Simplify Policy Initialization #165

Uh oh!

DNXie commented Sep 16, 2025 •

edited

Loading

Uh oh!

DNXie commented Sep 16, 2025

Uh oh!

allenwang28 commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor Configuration Structure: Move Service Configs to Top-Level and Simplify Policy Initialization #165

Refactor Configuration Structure: Move Service Configs to Top-Level and Simplify Policy Initialization #165

Uh oh!

Conversation

DNXie commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DNXie commented Sep 16, 2025

Uh oh!

allenwang28 commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DNXie commented Sep 16, 2025 •

edited

Loading