Skip to content

Support Elastic PyTorch in TrainJobΒ #2903

@andreyvelich

Description

@andreyvelich

What you would like to be added?

As part of Kubeflow Trainer API, we designed the ElasticPolicy API which should allow users to run PyTorch in elastic mode: https://docs.pytorch.org/docs/stable/elastic/run.html

It would be nice if someone could drive KEP for that, and propose API changes to the next release: Trainer v2.2

elasticPolicy:
  minNodes: 2
  maxNodes: 5
  metrics:
    - type: Resource
      resource:
        name: nvidia.com/gpu
        target:
          type: Utilization
          averageUtilization: 75

Potentially, we should support elastic JobSet for that: kubernetes-sigs/jobset#463

cc @kubeflow/kubeflow-trainer-team

/area api
/area controller
/help

Why is this needed?

We should support Elastic TrainJobs like in Training Operator V1 PyTorchJob.

Love this feature?

Give it a πŸ‘ We prioritize the features with most πŸ‘

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions