Skip to content

Conversation

pradeepfn
Copy link
Contributor

@pradeepfn pradeepfn commented Sep 5, 2025

1\ The existing code uses spawn_actor API. this is being deprecated. This diff ports the example to new service API.
2\ With new service API, procs/host details are part of the Service config. Hence I removed from the yaml config.
3\ With new service API, we don't have to call the setup called explicity, it is being called by the launch_service routines.

Run output:

(forge) [[email protected] ~/forge_fork (ts_trainer)]$ python -m apps.rl.main --config apps/rl/llama3_8b.yaml
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format

Aggregated Logs (2025-09-08 07:40:32) >>>
[1 similar log lines] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
<<< Aggregated Logs (2025-09-08 07:40:39) <<<

[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format

Aggregated Logs (2025-09-08 07:40:33) >>>
[1 similar log lines] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
<<< Aggregated Logs (2025-09-08 07:40:39) <<<

Aggregated Logs (2025-09-08 07:40:33) >>>
[1 similar log lines] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
<<< Aggregated Logs (2025-09-08 07:40:39) <<<

Aggregated Logs (2025-09-08 07:40:33) >>>
[1 similar log lines] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
<<< Aggregated Logs (2025-09-08 07:40:39) <<<

Services initialized....
shutting down...

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 5, 2025
@pradeepfn pradeepfn requested a review from pbontrager September 8, 2025 14:54
Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this

apps/rl/main.py Outdated
await asyncio.gather(
buffer.setup.call(),
trainer.setup.call(),
trainer = await spawn_service(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should still be in a asyncio.gather

apps/rl/main.py Outdated
buffer.setup.call(),
trainer.setup.call(),
trainer = await spawn_service(
ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=4),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still could have been sourced from the config with ServiceConfig(**cfg.trainer.pop("service")) but we don't have to change that for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add as part of next integration test PR. Did not occur to me that the yaml obj get parsed into a regular python dict. thanks.

@pradeepfn
Copy link
Contributor Author

PTAL. thanks!

@pradeepfn pradeepfn merged commit 69c6b1d into meta-pytorch:main Sep 8, 2025
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants