Skip to content

Support Scaling Data Designer on SLURM #160

@kirit93

Description

@kirit93

When Data Designer is used on a SLURM-managed GPU cluster, it should be able to automatically manage model servers required to run generation and preview jobs.

What this feature should do

  • Automatically spin up and tear down model servers on SLURM
  • Launch model servers (e.g. via vLLM) as SLURM jobs when needed.
  • Shut them down when they are no longer in use.

Support interactive preview workflows

  • Allow users to interactively query models for Data Designer preview jobs.
  • Support streaming responses.
  • Keep model servers alive for the duration of an interactive session, then clean them up.

Support large-scale batch generation

  • Scale model servers up and down to efficiently execute Data Designer jobs.
  • Execute work within a user-defined GPU budget for the job.
  • Users explicitly specify how many GPUs they are making available to a Data Designer job.
  • Data Designer uses only those GPUs and does not require manual placement or provisioning.

Data Designer determines how to:

  • Split work across models.
  • Scale model replicas.
  • Assign GPUs to each model instance.
  • Provide a simple user-facing configuration

Users specify:

  • Which models they want to use.
  • The total number of GPUs available to the job (and optionally per-model GPU needs).
  • Data Designer handles model lifecycle, scaling, and GPU utilization automatically.

Outcome

From the user’s perspective, running Data Designer on SLURM should require no manual model orchestration. Users declare their model needs and GPU budget, and Data Designer automatically provisions, scales, and cleans up model servers within those constraints.

Metadata

Metadata

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions