Skip to content

Conversation

DNXie
Copy link
Member

@DNXie DNXie commented Sep 16, 2025

As discussed in PR #153

Before

trainer:
  model_name: ${model}
  learning_rate: 1e-5
  service:
    procs_per_replica: 1
    num_replicas: 1
    with_gpus: true
Trainer.options(**cfg.trainer.service).as_service(**exclude_service(cfg.trainer))

After

trainer:
  model_name: ${model}
  learning_rate: 1e-5


services:
  trainer:
    procs_per_replica: 1
    num_replicas: 1
    with_gpus: true
Trainer.options(**cfg.services.trainer).as_service(**cfg.trainer)

Test

python -m apps.vllm.main --config apps/vllm/llama3_8b.yaml
python -m apps.vllm.main --config apps/vllm/qwen2_5_32b.yaml
python -m apps.vllm.main --config apps/vllm/deepseek_r1
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
python -m apps.grpo.main --config apps/grpo/qwen3_multinode.yaml

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 16, 2025
@DNXie
Copy link
Member Author

DNXie commented Sep 16, 2025

@allenwang28 When I was testing

python -m apps.grpo.main --config apps/grpo/qwen3_multinode.yaml

Because the config specifies hosts_per_replica: 1, I got the error saying

KeyError: "No named resource found for `gpu.small`. Registered named resources: ['NULL', 'MISSING']"

I had to change this line from

image="test", meshes=[f"{name}:{num_hosts}:gpu.small"]

to

image="test", meshes=[f"{name}:{num_hosts}:NULL"]

to make it work.

Does it indicate any errors with my local resource allocation?

@allenwang28
Copy link
Contributor

Re your question, I would expect that the multi-host doesn't run given that you're not running on the SLURM cluster

@DNXie DNXie merged commit 1300215 into meta-pytorch:main Sep 16, 2025
5 checks passed
@DNXie DNXie deleted the refactor_config branch September 16, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants