Proposal: Add Distributed Job Scheduler for Multi-GPU Training #2958

Nehanth · 2025-07-30T02:48:49Z

Nehanth
Jul 30, 2025

Description:

I'd like to propose extending the in-tree Llama Stack scheduler llama_stack/providers/utils/scheduler.py with support for multi-GPU distributed training using torchrun. This would take the form of a Distributed Job Scheduler that enables in-tree jobs to utilize all available GPUs for a single training job.

Motivation

Right now, the scheduler assumes a single-device execution model, which limits efficiency on machines with multiple GPUs. Many post-training workloads (like DPO, SFT, etc.) can benefit from parallelism or explicit GPU targeting — especially when leveraging distributed strategies like FSDP or launching multi-GPU jobs via torchrun.

Proposal

Introduce a Distributed Job Scheduler backend that:

Detects the number of available GPUs using torch.cuda.device_count()
Launches jobs using torchrun
Integrates with the existing Job, Scheduler, and JobHandler APIs
Preserves compatibility with existing single-GPU jobs and the naive scheduler backend

Inspiration

This work builds on and was inspired by this earlier closed PR authored by @cdoern . I’ve spoken with him directly and plan to continue and generalize his efforts into a usable in-tree Distributed Job Scheduler backend.

ashwinb · 2025-08-03T19:15:43Z

ashwinb
Aug 3, 2025
Collaborator

I think we absolutely need support for distributed schedulers. However I would like to make it more generic than just tying it to torchrun. There are other use casss in the stack (even for agents) where a more general notion of a job scheduler would be useful. I especially want to make sure torchrun is never run "in process" with the stack server.

I will respond with a more detailed "rough" proposal soon.

1 reply

Nehanth Aug 3, 2025
Author

Thanks for the opinion @ashwinb. That makes sense

Looking forward to your proposal.

ashwinb · 2025-08-05T17:37:09Z

ashwinb
Aug 5, 2025
Collaborator

Sorry it took me a bit to get back to this.

Proposal: Scheduler + Jobs APIs

I think we can extend our pattern of having lightweight internal APIs (KVStore, SqlStore, etc.) with swappable backends to job scheduling. But we need two APIs here:

A scheduler API that is used by components like post-training to submit jobs. Critically, this is internal to the stack and not exposed to users.
A jobs API that is used by users to manage jobs. It does not expose all the knobs that a full-fledged scheduler would, but it does expose enough to be useful.

Internal Scheduler API

Components like post-training use this to submit jobs:

class Scheduler(Protocol):
    async def submit_job(self, job_id: str, job_type: JobType...) -> Job: ...

External Jobs API

Users interact with this for job management:

class Jobs(Protocol):
    async def get_job(self, job_id: str) -> Job: ...
    async def cancel_job(self, job_id: str) -> None: ...
    async def list_jobs(self, filter: JobFilter, limit: int) -> list[Job]: ...

Backend Configurations

Following our established discriminated union pattern:

class DistributedSchedulerConfig(BaseModel):
    type: Literal["distributed"] = "distributed"
    auto_detect_gpus: bool = True
    max_gpus: int = None
    master_port: int = 29500

class CelerySchedulerConfig(BaseModel):
    type: Literal["celery"] = "celery" 
    broker_url: str = "redis://localhost:6379/0"
    result_backend: str = "redis://localhost:6379/0"

Integration Example

Post-training becomes much cleaner:

async def supervised_fine_tune(self, ...):
    job = await self.scheduler.submit_job(
        job_id=job_uuid,
        job_type=JobType.post_training,
        command=self._build_training_command(...),
        resources={"gpus": 4, "memory": "32GB"}
    )
    return PostTrainingJob(job_uuid=job.job_id)

We should never run jobs in process with the stack server -- except when a developer is iterating with a "local" distro. Jobs are always submitted to the scheduler backend, which handles the actual execution in separate processes or even remote workers.

What do you think?

cc @raghotham @ehhuang @leseb @mattf

0 replies

Nehanth · 2025-08-05T20:51:23Z

Nehanth
Aug 5, 2025
Author

Thanks for putting this together, @ashwinb. I really like the idea of separating things into an internal Scheduler API and external Jobs API with pluggable backends. That definitely broadens this beyond just torchrun and makes it more future-proof for agents or other distributed workloads.

So in order to start this off, these are the steps I’m thinking:

Define the new protocols – Create the Scheduler (internal) and Jobs (external) interfaces along with simple job/resource models.
Split the current Scheduler – Move job tracking into a separate JobStore and turn the existing naive runner into a local_dev backend.
Add backend loading via typed configs – Allow choosing a backend (local_dev, distributed, etc.) from config, with each following a common interface.
Implement a Distributed backend – Add GPU auto-detection and multi-GPU orchestration for workloads that can benefit from distributed execution, with torchrun as one possible execution method.

Does that sequence make sense as a starting point?

0 replies

ashwinb · 2025-08-05T21:11:16Z

ashwinb
Aug 5, 2025
Collaborator

I think before doing it all, let's hack up how the end-to-end distributed torchrun feels with these APIs. Pretend you have a "submit_job" API on a TorchRunScheduler class and produce a PR. Instead of forking to use torchrun, maybe use torch.elastic directly? Something like https://github.com/meta-llama/llama-stack/blob/e12524af85fdc5f5e2a0721797bddae5ff9199a1/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py#L262

Only when you are convinced if things are running end to end should you start productionizing with other things.

1 reply

Nehanth Aug 6, 2025
Author

Okay Ashwin, Will do!

Proposal: Add Distributed Job Scheduler for Multi-GPU Training #2958

Uh oh!

Uh oh!

Nehanth Jul 30, 2025

Description:

Motivation

Proposal

Inspiration

Replies: 4 comments · 2 replies

Uh oh!

ashwinb Aug 3, 2025 Collaborator

Uh oh!

Uh oh!

Nehanth Aug 3, 2025 Author

Uh oh!

ashwinb Aug 5, 2025 Collaborator

Proposal: Scheduler + Jobs APIs

Internal Scheduler API

External Jobs API

Backend Configurations

Integration Example

Uh oh!

Nehanth Aug 5, 2025 Author

Uh oh!

Uh oh!

ashwinb Aug 5, 2025 Collaborator

Uh oh!

Nehanth Aug 6, 2025 Author

Nehanth
Jul 30, 2025

Replies: 4 comments 2 replies

ashwinb
Aug 3, 2025
Collaborator

Nehanth Aug 3, 2025
Author

ashwinb
Aug 5, 2025
Collaborator

Nehanth
Aug 5, 2025
Author

ashwinb
Aug 5, 2025
Collaborator

Nehanth Aug 6, 2025
Author