project-codeflare
diff --git a/‎README.md‎
Lines changed: 9 additions & 0 deletions b/‎README.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎USAGE.md‎
Lines changed: 9 additions & 0 deletions b/‎USAGE.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎tools/pytorchjob-generator/README.md‎
Lines changed: 64 additions & 0 deletions b/‎tools/pytorchjob-generator/README.md‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎tools/pytorchjob-generator/chart/.helmignore‎
Lines changed: 1 addition & 0 deletions b/‎tools/pytorchjob-generator/chart/.helmignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎tools/pytorchjob-generator/chart/Chart.yaml‎
Lines changed: 6 additions & 0 deletions b/‎tools/pytorchjob-generator/chart/Chart.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎tools/pytorchjob-generator/chart/README.md‎
Lines changed: 77 additions & 0 deletions b/‎tools/pytorchjob-generator/chart/README.md‎
Lines changed: 77 additions & 0 deletions
diff --git a/‎tools/pytorchjob-generator/chart/README.md.gotmpl‎
Lines changed: 11 additions & 0 deletions b/‎tools/pytorchjob-generator/chart/README.md.gotmpl‎
Lines changed: 11 additions & 0 deletions
@@ -32,3 +32,12 @@ To learn how to setup MLBatch on a cluster and onboard teams see
 ## Quick Start
 
 To learn how to use MLBatch to run workloads see [USAGE.md](USAGE.md).
+
+### PyTorchJobs via the MLBatch Helm Chart
+
+Properly configuring a distributed `PyTorchJob` to make effective use of the
+MLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To
+automate this process, we provide a Helm chart that captures best practices and
+common configuration options. Using this Helm chart helps eliminate common
+mistakes. Please see [pytorchjob-generator](tools/pytorchjob-generator) for
+detailed usage instructions.
@@ -18,6 +18,15 @@ git clone --recursive https://github.com/project-codeflare/mlbatch.git
 cd mlbatch
 ```
 
+## PyTorchJobs via the MLBatch Helm Chart
+
+Properly configuring a distributed `PyTorchJob` to make effective use of the
+MLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To
+automate this process, we provide a Helm chart that captures best practices and
+common configuration options. Using this Helm chart helps eliminate common
+mistakes. Please see [pytorchjob-generator](tools/pytorchjob-generator) for
+detailed usage instructions.
+
 ## Queues
 
 All workloads must target a local queue in their namespace. The local queue name
 
@@ -0,0 +1,64 @@
+# PyTorchJob Generator
+
+The Helm chart provided in this folder facilitates the configuration of PyTorch
+jobs for submission to an OpenShift cluster implementing MLBatch.
+
+Invocations of this chart generate a `PyTorchJob` wrapped into an `AppWrapper`
+for better traceability and fault-tolerance.
+
+## Obtaining the Chart
+
+To start with, recursively clone and enter this repository:
+```sh
+git clone --recursive https://github.com/project-codeflare/mlbatch.git
+cd mlbatch
+```
+
+## Configuring the Job
+
+Create a `settings.yaml` file with the settings for the PyTorch job, for
+example:
+```yaml
+namespace: my-namespace       # namespace to deploy to (required)
+jobName: my-job               # name of the generated AppWrapper and PyTorchJob objects (required)
+queueName: default-queue      # local queue to submit to (default: default-queue)
+
+numPods: 4                    # total pod count including master and worker pods (default: 1)
+numCpusPerPod: 500m           # requested number of cpus per pod (default: 1)
+numGpusPerPod: 8              # requested number of gpus per pod (default: 0)
+totalMemoryPerPod: 1Gi        # requested amount of memory per pod (default: 1Gi)
+
+priority: default-priority    # default-priority (default), low-priority, or high-priority
+
+# container image for the pods (required)
+containerImage: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
+
+# setup commands to run in each pod (optional)
+setupCommands:                
+- git clone https://github.com/dbarnett/python-helloworld
+- cd python-helloworld
+
+# main program to invoke via torchrun (optional)
+mainProgram: helloworld.py
+```
+
+To learn more about the available settings see [chart/README.md](chart/README.md).
+
+## Submitting the Job
+
+To submit the Pytorch job to the cluster using the `settings.yaml` file, run:
+```sh
+helm template -f settings.yaml tools/pytorchjob-generator/chart | oc create -f-
+```
+
+To optionally capture the generated `AppWrapper` specification as a
+`generated.yaml` file, run instead:
+```sh
+helm template -f settings.yaml tools/pytorchjob-generator/chart | tee generated.yaml | oc create -f-
+```
+
+To remove the PyTorch job from the cluster, delete the generated `AppWrapper`
+object:
+```sh
+oc delete appwrapper -n my-namespace my-job
+```
@@ -0,0 +1 @@
+tests
@@ -0,0 +1,6 @@
+apiVersion: v2
+name: pytorchjob-generator
+description: An AppWrapper generator for PyTorchJobs
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
@@ -0,0 +1,77 @@
+# pytorchjob-generator
+
+An AppWrapper generator for PyTorchJobs
+
+![Version: 0.1.0](https://img.shields.io/badge/Version-0.1.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 1.16.0](https://img.shields.io/badge/AppVersion-1.16.0-informational?style=flat-square)
+
+## Overview
+
+This file documents the variables that may be set in a user's `settings.yaml` to
+customize the Jobs generated by the tool.
+
+## Values
+
+### Job Metadata
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| namespace | string | must be provided by user | The Kubernetes namespace in which the Job will run. |
+| jobName | string | must be provided by user | Name of the Job.  Will be the name of the AppWrapper and the PyTorchJob. |
+| queueName | string | `"default-queue"` | Name of the local queue to which the Job will be submitted. |
+| priority | string | `"default-priority"` | Type of priority for the job (choose from: "default-priority", "low-priority" or "high-priority"). WARNING: "high-priority" jobs need to be approved (We're watching you...)! |
+| customLabels | array | `nil` | Optional array of custom labels to add to all the resources created by the Job (the PyTorchJob, the PodGroup, and the AppWrapper). |
+| containerImage | string | must be provided by the user | Image used for creating the Job's containers (needs to have all the applications your job may need) |
+| imagePullSecrets | array | `nil` | List of image-pull-secrets to be used for pulling containerImages  |
+
+### Resource Requirements
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| numPods | integer | `1` | Total number of pods (i.e. master + worker pods) to be created |
+| numCpusPerPod | integer or string | `1` | Number of CPUs for each pod. May be a positive integer or a ResourceQuantity (eg 500m) |
+| numGpusPerPod | integer | `0` | Number of GPUs for each pod (all GPUs per node is currently recommended for distributed training). |
+| totalMemoryPerPod | string | `"1Gi"` | Total memory for each pod expressed as a ResourceQuantity (eg 1Gi, 200M, etc.). |
+| limitCpusPerPod | integer or string | numCpusPerPod | Limit on the number of CPUs per pod for elastic jobs. May be a positive integer or a ResourceQuantity (eg 500m). |
+| limitGpusPerPod | integer | numGpusPerPod | Limit of number of GPUs per pod for elastic jobs. |
+| limitMemoryPerPod | string | totalMemoryPerPod | Limit of total memory per pod for elastic jobs (eg 1Gi, 200M, etc.). |
+
+### Workload Specification
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| environmentVariables | array | `nil` | List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets.  See [values.yaml](values.yaml) for examples of supported syntaxes.  NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. |
+| sshGitCloneConfig | object | `nil` | Private GitHub clone support. See [values.yaml](values.yaml) for additional instructions. |
+| setupCommands | array | no custom commands are executed | List of custom commands to be ran at the beginning of the execution. Use `setupCommand` to clone code, download data, and change directories. |
+| mainProgram | string | `nil` | Name of the PyTorch program to be executed by `torchrun`. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only `setupCommands` are executed and torchrun is elided. |
+| volumes | array | No volumes are mounted | List of "(name, claimName, mountPath)" of volumes, with persistentVolumeClaim, to be mounted to the infrastructure |
+
+### Advanced Options
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| roceGdrResName | string | nvidia.com/roce_gdr | RoCE GDR resource name (can vary by cluster configuration) |
+| numRoceGdr | integer | `0` | number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). |
+| topologyFileConfigMap | string | `nil` | Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr |
+| ncclGdrEnvConfigMap | string | `nil` | Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars |
+| multiNicNetworkName | string | `nil` | Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with `oc get multinicnetwork`.  |
+| disableSharedMemory | boolean | `false` | Control whether or not a shared memory volume is added to the PyTorchJob. |
+| mountNVMe | object | `nil` | Mount NVMe as a volume. The environment variable MOUNT_PATH_NVME provides the runtime mount path |
+| initContainers | array | `nil` | List of "(name, image, command[])" specifying an init containers to be run before the main job. The 'command' field is a list of commands to run in the container, see the Kubernetes entry on initContainers for reference.  |
+| autopilotHealthChecks | array | No pre-flight checks are enabled. | Autopilot health checks. List of labels enabling one or more system health pre-flight checks. |
+| hostIgnoreList | array | `nil` | List of host names on which the Job must not be scheduled (to avoid faulty nodes). |
+| bypassCoscheduler | boolean | `false` | If true, use the default Kubernetes scheduler instead of the co-scheduler. ***Setting this to true will result in GPU fragmentation on the cluster. It should only be set to true when explicitly directed to do so by a cluster admin!*** |
+| serviceAccountName | string | the default service account for the namespace will be used. | Service account to be used for running the Job |
+
+### Fault Tolerance
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| admissionGracePeriodDuration | string | `"60s"` | Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
+| warmupGracePeriodDuration | string | `"300s"` | Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
+| failureGracePeriodDuration | string | `"60s"` | Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
+| retryPausePeriodDuration | string | `"90s"` | Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
+| retryLimit | integer | `3` | Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
+| forcefulDeletionGracePeriodDuration | string | `"600s"` | Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
+| deletionOnFailureGracePeriodDuration | string | `"0s"` | Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
+| restartPolicy | string | `"Never"` | Set Kubernertes policy for restarting failed containers "in place" (without restarting the Pod). |
+| terminationGracePeriodSeconds | integer | Kubernetes's default value is used | Set a non-default pod termination grace period (in seconds). |
@@ -0,0 +1,11 @@
+{{ template "chart.header" . }}
+{{ template "chart.description" . }}
+
+{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}
+
+## Overview
+
+This file documents the variables that may be set in a user's `settings.yaml` to
+customize the Jobs generated by the tool. 
+
+{{ template "chart.valuesSection" . }}