|
| 1 | +# pytorchjob-generator |
| 2 | + |
| 3 | +An AppWrapper generator for PyTorchJobs |
| 4 | + |
| 5 | +   |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +This file documents the variables that may be set in a user's `settings.yaml` to |
| 10 | +customize the Jobs generated by the tool. |
| 11 | + |
| 12 | +## Values |
| 13 | + |
| 14 | +### Job Metadata |
| 15 | + |
| 16 | +| Key | Type | Default | Description | |
| 17 | +|-----|------|---------|-------------| |
| 18 | +| namespace | string | must be provided by user | The Kubernetes namespace in which the Job will run. | |
| 19 | +| jobName | string | must be provided by user | Name of the Job. Will be the name of the AppWrapper and the PyTorchJob. | |
| 20 | +| queueName | string | `"default-queue"` | Name of the local queue to which the Job will be submitted. | |
| 21 | +| priority | string | `"default-priority"` | Type of priority for the job (choose from: "default-priority", "low-priority" or "high-priority"). WARNING: "high-priority" jobs need to be approved (We're watching you...)! | |
| 22 | +| customLabels | array | `nil` | Optional array of custom labels to add to all the resources created by the Job (the PyTorchJob, the PodGroup, and the AppWrapper). | |
| 23 | +| containerImage | string | must be provided by the user | Image used for creating the Job's containers (needs to have all the applications your job may need) | |
| 24 | +| imagePullSecrets | array | `nil` | List of image-pull-secrets to be used for pulling containerImages | |
| 25 | + |
| 26 | +### Resource Requirements |
| 27 | + |
| 28 | +| Key | Type | Default | Description | |
| 29 | +|-----|------|---------|-------------| |
| 30 | +| numPods | integer | `1` | Total number of pods (i.e. master + worker pods) to be created | |
| 31 | +| numCpusPerPod | integer or string | `1` | Number of CPUs for each pod. May be a positive integer or a ResourceQuantity (eg 500m) | |
| 32 | +| numGpusPerPod | integer | `0` | Number of GPUs for each pod (all GPUs per node is currently recommended for distributed training). | |
| 33 | +| totalMemoryPerPod | string | `"1Gi"` | Total memory for each pod expressed as a ResourceQuantity (eg 1Gi, 200M, etc.). | |
| 34 | +| limitCpusPerPod | integer or string | numCpusPerPod | Limit on the number of CPUs per pod for elastic jobs. May be a positive integer or a ResourceQuantity (eg 500m). | |
| 35 | +| limitGpusPerPod | integer | numGpusPerPod | Limit of number of GPUs per pod for elastic jobs. | |
| 36 | +| limitMemoryPerPod | string | totalMemoryPerPod | Limit of total memory per pod for elastic jobs (eg 1Gi, 200M, etc.). | |
| 37 | + |
| 38 | +### Workload Specification |
| 39 | + |
| 40 | +| Key | Type | Default | Description | |
| 41 | +|-----|------|---------|-------------| |
| 42 | +| environmentVariables | array | `nil` | List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets. See [values.yaml](values.yaml) for examples of supported syntaxes. NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. | |
| 43 | +| sshGitCloneConfig | object | `nil` | Private GitHub clone support. See [values.yaml](values.yaml) for additional instructions. | |
| 44 | +| setupCommands | array | no custom commands are executed | List of custom commands to be ran at the beginning of the execution. Use `setupCommand` to clone code, download data, and change directories. | |
| 45 | +| mainProgram | string | `nil` | Name of the PyTorch program to be executed by `torchrun`. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only `setupCommands` are executed and torchrun is elided. | |
| 46 | +| volumes | array | No volumes are mounted | List of "(name, claimName, mountPath)" of volumes, with persistentVolumeClaim, to be mounted to the infrastructure | |
| 47 | + |
| 48 | +### Advanced Options |
| 49 | + |
| 50 | +| Key | Type | Default | Description | |
| 51 | +|-----|------|---------|-------------| |
| 52 | +| roceGdrResName | string | nvidia.com/roce_gdr | RoCE GDR resource name (can vary by cluster configuration) | |
| 53 | +| numRoceGdr | integer | `0` | number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). | |
| 54 | +| topologyFileConfigMap | string | `nil` | Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr | |
| 55 | +| ncclGdrEnvConfigMap | string | `nil` | Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars | |
| 56 | +| multiNicNetworkName | string | `nil` | Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with `oc get multinicnetwork`. | |
| 57 | +| disableSharedMemory | boolean | `false` | Control whether or not a shared memory volume is added to the PyTorchJob. | |
| 58 | +| mountNVMe | object | `nil` | Mount NVMe as a volume. The environment variable MOUNT_PATH_NVME provides the runtime mount path | |
| 59 | +| initContainers | array | `nil` | List of "(name, image, command[])" specifying an init containers to be run before the main job. The 'command' field is a list of commands to run in the container, see the Kubernetes entry on initContainers for reference. | |
| 60 | +| autopilotHealthChecks | array | No pre-flight checks are enabled. | Autopilot health checks. List of labels enabling one or more system health pre-flight checks. | |
| 61 | +| hostIgnoreList | array | `nil` | List of host names on which the Job must not be scheduled (to avoid faulty nodes). | |
| 62 | +| bypassCoscheduler | boolean | `false` | If true, use the default Kubernetes scheduler instead of the co-scheduler. ***Setting this to true will result in GPU fragmentation on the cluster. It should only be set to true when explicitly directed to do so by a cluster admin!*** | |
| 63 | +| serviceAccountName | string | the default service account for the namespace will be used. | Service account to be used for running the Job | |
| 64 | + |
| 65 | +### Fault Tolerance |
| 66 | + |
| 67 | +| Key | Type | Default | Description | |
| 68 | +|-----|------|---------|-------------| |
| 69 | +| admissionGracePeriodDuration | string | `"60s"` | Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ | |
| 70 | +| warmupGracePeriodDuration | string | `"300s"` | Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ | |
| 71 | +| failureGracePeriodDuration | string | `"60s"` | Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ | |
| 72 | +| retryPausePeriodDuration | string | `"90s"` | Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ | |
| 73 | +| retryLimit | integer | `3` | Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ | |
| 74 | +| forcefulDeletionGracePeriodDuration | string | `"600s"` | Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ | |
| 75 | +| deletionOnFailureGracePeriodDuration | string | `"0s"` | Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ | |
| 76 | +| restartPolicy | string | `"Never"` | Set Kubernertes policy for restarting failed containers "in place" (without restarting the Pod). | |
| 77 | +| terminationGracePeriodSeconds | integer | Kubernetes's default value is used | Set a non-default pod termination grace period (in seconds). | |
0 commit comments