-
Notifications
You must be signed in to change notification settings - Fork 16
spawn service based trainer #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this
apps/rl/main.py
Outdated
await asyncio.gather( | ||
buffer.setup.call(), | ||
trainer.setup.call(), | ||
trainer = await spawn_service( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should still be in a asyncio.gather
apps/rl/main.py
Outdated
buffer.setup.call(), | ||
trainer.setup.call(), | ||
trainer = await spawn_service( | ||
ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still could have been sourced from the config with ServiceConfig(**cfg.trainer.pop("service"))
but we don't have to change that for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add as part of next integration test PR. Did not occur to me that the yaml obj get parsed into a regular python dict. thanks.
PTAL. thanks! |
1\ The existing code uses spawn_actor API. this is being deprecated. This diff ports the example to new service API.
2\ With new service API, procs/host details are part of the Service config. Hence I removed from the yaml config.
3\ With new service API, we don't have to call the setup called explicity, it is being called by the launch_service routines.
Run output:
(forge) [[email protected] ~/forge_fork (ts_trainer)]$ python -m apps.rl.main --config apps/rl/llama3_8b.yaml
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[0] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
Services initialized....
shutting down...