Skip to content

KEP Automatic configuration of GPU requests for TrainJobsΒ #3328

@VassilisVassiliadis

Description

@VassilisVassiliadis

What you would like to be added?

I would like to add intelligent GPU assignment to Kubeflow Trainer, enabling the controller to dynamically determine the appropriate GPU resources (e.g. GPU count, memory, number of replicas, etc) or even training options (e.g. batch size, tuning method, hyper parameters, etc) based on user-provided training configuration.

Examples of configuration options are:

  • model type
  • sequence length
  • parallelism hints
  • batch size
  • quality of service hints, etc

The goal is to help users avoid having to manually specify options that can be tricky to come up with like resource requirements.

The TrainJob controller should facilitate the mutation of the job before scheduling/admission (e.g., via scheduling gates or suspension).
The mutated job should contain the final configuration options (resources, other settings, etc) so that the scheduler sees the true resource requirements at admission.

Ideally we'd like a way to support trying out different policies for mutating the Jobs in some kind of plugin fashion. This is because it's likely that it's hard to come up with a one-size-fits-all solution that optimally determines the "optimal" resources/configurations for jobs.

Such a mechanism is consistent with the "separation of concerns" design principles of Kubeflow Trainer v2 which aims to simplify distributed training and improve cluster efficiency.
Dynamically determining resources from user intent aligns with Trainer's design for scalable, efficient training on Kubernetes.

Why is this needed?

This feature is needed because users currently must manually specify GPU resources, which can lead to suboptimal utilization and resource pressure especially in multi tenant environments. Additionally, this would enable systems that continuously monitor the capacity and utilization of hardware to intelligently decide the resources of Jobs based on their configuration, priority, etc. This could result in increased hardware utilization and or queue/execution time reduction of jobs.

Love this feature?

Give it a πŸ‘ We prioritize the features with most πŸ‘

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions