-
Notifications
You must be signed in to change notification settings - Fork 934
Description
What you would like to be added?
I would like to add intelligent GPU assignment to Kubeflow Trainer, enabling the controller to dynamically determine the appropriate GPU resources (e.g. GPU count, memory, number of replicas, etc) or even training options (e.g. batch size, tuning method, hyper parameters, etc) based on user-provided training configuration.
Examples of configuration options are:
- model type
- sequence length
- parallelism hints
- batch size
- quality of service hints, etc
The goal is to help users avoid having to manually specify options that can be tricky to come up with like resource requirements.
The TrainJob controller should facilitate the mutation of the job before scheduling/admission (e.g., via scheduling gates or suspension).
The mutated job should contain the final configuration options (resources, other settings, etc) so that the scheduler sees the true resource requirements at admission.
Ideally we'd like a way to support trying out different policies for mutating the Jobs in some kind of plugin fashion. This is because it's likely that it's hard to come up with a one-size-fits-all solution that optimally determines the "optimal" resources/configurations for jobs.
Such a mechanism is consistent with the "separation of concerns" design principles of Kubeflow Trainer v2 which aims to simplify distributed training and improve cluster efficiency.
Dynamically determining resources from user intent aligns with Trainer's design for scalable, efficient training on Kubernetes.
Why is this needed?
This feature is needed because users currently must manually specify GPU resources, which can lead to suboptimal utilization and resource pressure especially in multi tenant environments. Additionally, this would enable systems that continuously monitor the capacity and utilization of hardware to intelligently decide the resources of Jobs based on their configuration, priority, etc. This could result in increased hardware utilization and or queue/execution time reduction of jobs.
Love this feature?
Give it a π We prioritize the features with most π