Skip to content

Commit 81e950f

Browse files
committed
initial import of pytorchjob-generator
1 parent ca8e541 commit 81e950f

File tree

15 files changed

+2496
-0
lines changed

15 files changed

+2496
-0
lines changed

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,12 @@ To learn how to setup MLBatch on a cluster and onboard teams see
3232
## Quick Start
3333

3434
To learn how to use MLBatch to run workloads see [USAGE.md](USAGE.md).
35+
36+
### PyTorchJobs via the MLBatch Helm Chart
37+
38+
Properly configuring a distributed `PyTorchJob` to make effective use of the
39+
MLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To
40+
automate this process, we provide a Helm chart that captures best practices and
41+
common configuration options. Using this Helm chart helps eliminate common
42+
mistakes. Please see [pytorchjob-generator](tools/pytorchjob-generator) for
43+
detailed usage instructions.

USAGE.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,15 @@ git clone --recursive https://github.com/project-codeflare/mlbatch.git
1818
cd mlbatch
1919
```
2020

21+
## PyTorchJobs via the MLBatch Helm Chart
22+
23+
Properly configuring a distributed `PyTorchJob` to make effective use of the
24+
MLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To
25+
automate this process, we provide a Helm chart that captures best practices and
26+
common configuration options. Using this Helm chart helps eliminate common
27+
mistakes. Please see [pytorchjob-generator](tools/pytorchjob-generator) for
28+
detailed usage instructions.
29+
2130
## Queues
2231

2332
All workloads must target a local queue in their namespace. The local queue name
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# PyTorchJob Generator
2+
3+
The Helm chart provided in this folder facilitates the configuration of PyTorch
4+
jobs for submission to an OpenShift cluster implementing MLBatch.
5+
6+
Invocations of this chart generate a `PyTorchJob` wrapped into an `AppWrapper`
7+
for better traceability and fault-tolerance.
8+
9+
## Obtaining the Chart
10+
11+
To start with, recursively clone and enter this repository:
12+
```sh
13+
git clone --recursive https://github.com/project-codeflare/mlbatch.git
14+
cd mlbatch
15+
```
16+
17+
## Configuring the Job
18+
19+
Create a `settings.yaml` file with the settings for the PyTorch job, for
20+
example:
21+
```yaml
22+
namespace: my-namespace # namespace to deploy to (required)
23+
jobName: my-job # name of the generated AppWrapper and PyTorchJob objects (required)
24+
queueName: default-queue # local queue to submit to (default: default-queue)
25+
26+
numPods: 4 # total pod count including master and worker pods (default: 1)
27+
numCpusPerPod: 500m # requested number of cpus per pod (default: 1)
28+
numGpusPerPod: 8 # requested number of gpus per pod (default: 0)
29+
totalMemoryPerPod: 1Gi # requested amount of memory per pod (default: 1Gi)
30+
31+
priority: default-priority # default-priority (default), low-priority, or high-priority
32+
33+
# container image for the pods (required)
34+
containerImage: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
35+
36+
# setup commands to run in each pod (optional)
37+
setupCommands:
38+
- git clone https://github.com/dbarnett/python-helloworld
39+
- cd python-helloworld
40+
41+
# main program to invoke via torchrun (optional)
42+
mainProgram: helloworld.py
43+
```
44+
45+
To learn more about the available settings see [chart/README.md](chart/README.md).
46+
47+
## Submitting the Job
48+
49+
To submit the Pytorch job to the cluster using the `settings.yaml` file, run:
50+
```sh
51+
helm template -f settings.yaml tools/pytorchjob-generator/chart | oc create -f-
52+
```
53+
54+
To optionally capture the generated `AppWrapper` specification as a
55+
`generated.yaml` file, run instead:
56+
```sh
57+
helm template -f settings.yaml tools/pytorchjob-generator/chart | tee generated.yaml | oc create -f-
58+
```
59+
60+
To remove the PyTorch job from the cluster, delete the generated `AppWrapper`
61+
object:
62+
```sh
63+
oc delete appwrapper -n my-namespace my-job
64+
```
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
tests
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: v2
2+
name: pytorchjob-generator
3+
description: An AppWrapper generator for PyTorchJobs
4+
type: application
5+
version: 0.1.0
6+
appVersion: "1.16.0"
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# pytorchjob-generator
2+
3+
An AppWrapper generator for PyTorchJobs
4+
5+
![Version: 0.1.0](https://img.shields.io/badge/Version-0.1.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 1.16.0](https://img.shields.io/badge/AppVersion-1.16.0-informational?style=flat-square)
6+
7+
## Overview
8+
9+
This file documents the variables that may be set in a user's `settings.yaml` to
10+
customize the Jobs generated by the tool.
11+
12+
## Values
13+
14+
### Job Metadata
15+
16+
| Key | Type | Default | Description |
17+
|-----|------|---------|-------------|
18+
| namespace | string | must be provided by user | The Kubernetes namespace in which the Job will run. |
19+
| jobName | string | must be provided by user | Name of the Job. Will be the name of the AppWrapper and the PyTorchJob. |
20+
| queueName | string | `"default-queue"` | Name of the local queue to which the Job will be submitted. |
21+
| priority | string | `"default-priority"` | Type of priority for the job (choose from: "default-priority", "low-priority" or "high-priority"). WARNING: "high-priority" jobs need to be approved (We're watching you...)! |
22+
| customLabels | array | `nil` | Optional array of custom labels to add to all the resources created by the Job (the PyTorchJob, the PodGroup, and the AppWrapper). |
23+
| containerImage | string | must be provided by the user | Image used for creating the Job's containers (needs to have all the applications your job may need) |
24+
| imagePullSecrets | array | `nil` | List of image-pull-secrets to be used for pulling containerImages |
25+
26+
### Resource Requirements
27+
28+
| Key | Type | Default | Description |
29+
|-----|------|---------|-------------|
30+
| numPods | integer | `1` | Total number of pods (i.e. master + worker pods) to be created |
31+
| numCpusPerPod | integer or string | `1` | Number of CPUs for each pod. May be a positive integer or a ResourceQuantity (eg 500m) |
32+
| numGpusPerPod | integer | `0` | Number of GPUs for each pod (all GPUs per node is currently recommended for distributed training). |
33+
| totalMemoryPerPod | string | `"1Gi"` | Total memory for each pod expressed as a ResourceQuantity (eg 1Gi, 200M, etc.). |
34+
| limitCpusPerPod | integer or string | numCpusPerPod | Limit on the number of CPUs per pod for elastic jobs. May be a positive integer or a ResourceQuantity (eg 500m). |
35+
| limitGpusPerPod | integer | numGpusPerPod | Limit of number of GPUs per pod for elastic jobs. |
36+
| limitMemoryPerPod | string | totalMemoryPerPod | Limit of total memory per pod for elastic jobs (eg 1Gi, 200M, etc.). |
37+
38+
### Workload Specification
39+
40+
| Key | Type | Default | Description |
41+
|-----|------|---------|-------------|
42+
| environmentVariables | array | `nil` | List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets. See [values.yaml](values.yaml) for examples of supported syntaxes. NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. |
43+
| sshGitCloneConfig | object | `nil` | Private GitHub clone support. See [values.yaml](values.yaml) for additional instructions. |
44+
| setupCommands | array | no custom commands are executed | List of custom commands to be ran at the beginning of the execution. Use `setupCommand` to clone code, download data, and change directories. |
45+
| mainProgram | string | `nil` | Name of the PyTorch program to be executed by `torchrun`. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only `setupCommands` are executed and torchrun is elided. |
46+
| volumes | array | No volumes are mounted | List of "(name, claimName, mountPath)" of volumes, with persistentVolumeClaim, to be mounted to the infrastructure |
47+
48+
### Advanced Options
49+
50+
| Key | Type | Default | Description |
51+
|-----|------|---------|-------------|
52+
| roceGdrResName | string | nvidia.com/roce_gdr | RoCE GDR resource name (can vary by cluster configuration) |
53+
| numRoceGdr | integer | `0` | number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). |
54+
| topologyFileConfigMap | string | `nil` | Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr |
55+
| ncclGdrEnvConfigMap | string | `nil` | Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars |
56+
| multiNicNetworkName | string | `nil` | Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with `oc get multinicnetwork`. |
57+
| disableSharedMemory | boolean | `false` | Control whether or not a shared memory volume is added to the PyTorchJob. |
58+
| mountNVMe | object | `nil` | Mount NVMe as a volume. The environment variable MOUNT_PATH_NVME provides the runtime mount path |
59+
| initContainers | array | `nil` | List of "(name, image, command[])" specifying an init containers to be run before the main job. The 'command' field is a list of commands to run in the container, see the Kubernetes entry on initContainers for reference. |
60+
| autopilotHealthChecks | array | No pre-flight checks are enabled. | Autopilot health checks. List of labels enabling one or more system health pre-flight checks. |
61+
| hostIgnoreList | array | `nil` | List of host names on which the Job must not be scheduled (to avoid faulty nodes). |
62+
| bypassCoscheduler | boolean | `false` | If true, use the default Kubernetes scheduler instead of the co-scheduler. ***Setting this to true will result in GPU fragmentation on the cluster. It should only be set to true when explicitly directed to do so by a cluster admin!*** |
63+
| serviceAccountName | string | the default service account for the namespace will be used. | Service account to be used for running the Job |
64+
65+
### Fault Tolerance
66+
67+
| Key | Type | Default | Description |
68+
|-----|------|---------|-------------|
69+
| admissionGracePeriodDuration | string | `"60s"` | Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
70+
| warmupGracePeriodDuration | string | `"300s"` | Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
71+
| failureGracePeriodDuration | string | `"60s"` | Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
72+
| retryPausePeriodDuration | string | `"90s"` | Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
73+
| retryLimit | integer | `3` | Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
74+
| forcefulDeletionGracePeriodDuration | string | `"600s"` | Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
75+
| deletionOnFailureGracePeriodDuration | string | `"0s"` | Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
76+
| restartPolicy | string | `"Never"` | Set Kubernertes policy for restarting failed containers "in place" (without restarting the Pod). |
77+
| terminationGracePeriodSeconds | integer | Kubernetes's default value is used | Set a non-default pod termination grace period (in seconds). |
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{{ template "chart.header" . }}
2+
{{ template "chart.description" . }}
3+
4+
{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}
5+
6+
## Overview
7+
8+
This file documents the variables that may be set in a user's `settings.yaml` to
9+
customize the Jobs generated by the tool.
10+
11+
{{ template "chart.valuesSection" . }}

0 commit comments

Comments
 (0)