Skip to content

Commit 6af673a

Browse files
committed
feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
1 parent ec82bb1 commit 6af673a

39 files changed

+124
-435
lines changed

api/openapi-spec/swagger.json

Lines changed: 3 additions & 14 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/__init__.py

Lines changed: 0 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_ml_policy.py

Lines changed: 2 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_ml_policy_source.py

Lines changed: 2 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -594,21 +594,6 @@ spec:
594594
format: int32
595595
type: integer
596596
type: object
597-
numProcPerNode:
598-
anyOf:
599-
- type: integer
600-
- type: string
601-
default: auto
602-
description: |-
603-
numProcPerNode is the number of processes per node.
604-
This value is inserted into the `--nproc-per-node` argument of the `torchrun` CLI.
605-
Supported values: `auto`, `cpu`, `gpu`, or int value.
606-
Defaults to `auto`.
607-
x-kubernetes-int-or-string: true
608-
x-kubernetes-validations:
609-
- message: NumProcPerNode must be equal to auto, cpu, gpu,
610-
or int value
611-
rule: self > 0 || self in ['auto', 'cpu', 'gpu']
612597
type: object
613598
type: object
614599
x-kubernetes-validations:

charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -594,21 +594,6 @@ spec:
594594
format: int32
595595
type: integer
596596
type: object
597-
numProcPerNode:
598-
anyOf:
599-
- type: integer
600-
- type: string
601-
default: auto
602-
description: |-
603-
numProcPerNode is the number of processes per node.
604-
This value is inserted into the `--nproc-per-node` argument of the `torchrun` CLI.
605-
Supported values: `auto`, `cpu`, `gpu`, or int value.
606-
Defaults to `auto`.
607-
x-kubernetes-int-or-string: true
608-
x-kubernetes-validations:
609-
- message: NumProcPerNode must be equal to auto, cpu, gpu,
610-
or int value
611-
rule: self > 0 || self in ['auto', 'cpu', 'gpu']
612597
type: object
613598
type: object
614599
x-kubernetes-validations:

charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4843,14 +4843,10 @@ spec:
48434843
format: int32
48444844
type: integer
48454845
numProcPerNode:
4846-
anyOf:
4847-
- type: integer
4848-
- type: string
4849-
description: |-
4850-
numProcPerNode is the number of processes/workers/slots on every training node.
4851-
For the Torch runtime: `auto`, `cpu`, `gpu`, or int value can be set.
4852-
For the MPI runtime only int value can be set.
4853-
x-kubernetes-int-or-string: true
4846+
description: numProcPerNode is the number of processes/workers/slots
4847+
on every training node.
4848+
format: int32
4849+
type: integer
48544850
resourcesPerNode:
48554851
description: resourcesPerNode defines the compute resources for
48564852
each training node.

charts/kubeflow-trainer/templates/runtimes/data-cache/torch-distributed-with-cache.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,7 @@ metadata:
2929
spec:
3030
mlPolicy:
3131
numNodes: 1
32-
torch:
33-
numProcPerNode: auto
32+
torch: {}
3433
template:
3534
spec:
3635
replicatedJobs:

charts/kubeflow-trainer/templates/runtimes/torch-distributed.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,7 @@ metadata:
2525
spec:
2626
mlPolicy:
2727
numNodes: 1
28-
torch:
29-
numProcPerNode: auto
28+
torch: {}
3029
template:
3130
spec:
3231
replicatedJobs:

charts/kubeflow-trainer/templates/runtimes/torchtune/llama3_2_1B.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,7 @@ metadata:
2525
spec:
2626
mlPolicy:
2727
numNodes: 1
28-
torch:
29-
numProcPerNode: auto
28+
torch: {}
3029
template:
3130
spec:
3231
volumeClaimPolicies:

0 commit comments

Comments
 (0)