Replies: 4 comments 2 replies
-
I think we absolutely need support for distributed schedulers. However I would like to make it more generic than just tying it to torchrun. There are other use casss in the stack (even for agents) where a more general notion of a job scheduler would be useful. I especially want to make sure torchrun is never run "in process" with the stack server. I will respond with a more detailed "rough" proposal soon. |
Beta Was this translation helpful? Give feedback.
-
Sorry it took me a bit to get back to this. Proposal: Scheduler + Jobs APIsI think we can extend our pattern of having lightweight internal APIs (KVStore, SqlStore, etc.) with swappable backends to job scheduling. But we need two APIs here:
Internal Scheduler APIComponents like post-training use this to submit jobs: class Scheduler(Protocol):
async def submit_job(self, job_id: str, job_type: JobType...) -> Job: ... External Jobs APIUsers interact with this for job management: class Jobs(Protocol):
async def get_job(self, job_id: str) -> Job: ...
async def cancel_job(self, job_id: str) -> None: ...
async def list_jobs(self, filter: JobFilter, limit: int) -> list[Job]: ... Backend ConfigurationsFollowing our established discriminated union pattern: class DistributedSchedulerConfig(BaseModel):
type: Literal["distributed"] = "distributed"
auto_detect_gpus: bool = True
max_gpus: int = None
master_port: int = 29500
class CelerySchedulerConfig(BaseModel):
type: Literal["celery"] = "celery"
broker_url: str = "redis://localhost:6379/0"
result_backend: str = "redis://localhost:6379/0" Integration ExamplePost-training becomes much cleaner: async def supervised_fine_tune(self, ...):
job = await self.scheduler.submit_job(
job_id=job_uuid,
job_type=JobType.post_training,
command=self._build_training_command(...),
resources={"gpus": 4, "memory": "32GB"}
)
return PostTrainingJob(job_uuid=job.job_id) We should never run jobs in process with the stack server -- except when a developer is iterating with a "local" distro. Jobs are always submitted to the scheduler backend, which handles the actual execution in separate processes or even remote workers. What do you think? |
Beta Was this translation helpful? Give feedback.
-
Thanks for putting this together, @ashwinb. I really like the idea of separating things into an internal Scheduler API and external Jobs API with pluggable backends. That definitely broadens this beyond just torchrun and makes it more future-proof for agents or other distributed workloads. So in order to start this off, these are the steps I’m thinking:
Does that sequence make sense as a starting point? |
Beta Was this translation helpful? Give feedback.
-
I think before doing it all, let's hack up how the end-to-end distributed torchrun feels with these APIs. Pretend you have a "submit_job" API on a Only when you are convinced if things are running end to end should you start productionizing with other things. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description:
I'd like to propose extending the in-tree Llama Stack scheduler llama_stack/providers/utils/scheduler.py with support for multi-GPU distributed training using torchrun. This would take the form of a Distributed Job Scheduler that enables in-tree jobs to utilize all available GPUs for a single training job.
Motivation
Right now, the scheduler assumes a single-device execution model, which limits efficiency on machines with multiple GPUs. Many post-training workloads (like DPO, SFT, etc.) can benefit from parallelism or explicit GPU targeting — especially when leveraging distributed strategies like FSDP or launching multi-GPU jobs via torchrun.
Proposal
Introduce a Distributed Job Scheduler backend that:
Inspiration
This work builds on and was inspired by this earlier closed PR authored by @cdoern . I’ve spoken with him directly and plan to continue and generalize his efforts into a usable in-tree Distributed Job Scheduler backend.
Beta Was this translation helpful? Give feedback.
All reactions