@@ -8,7 +8,11 @@ classes: wide
88
99The AppWrapper controller is designed to enhance and extend the fault
1010tolerance capabilities provided by the controllers of its wrapped
11- resources. Throughout the execution of a workload, the AppWrapper
11+ resources. If [ Autopilot] ( https://github.com/ibm/autopilot ) is deployed on the
12+ cluster, the AppWrapper controller can automate both the injection of
13+ Node anti-affinites to avoid scheduling workloads on unhealthy Nodes
14+ and the migration of running workloads away from unhealthy Nodes.
15+ Throughout the execution of a workload, the AppWrapper
1216controller monitors both the status of the contained top-level
1317resources and the status of all Pods created by the workload. If a
1418workload is determined to be * unhealthy* , the AppWrapper controller
@@ -21,7 +25,6 @@ engineered to ensure that it will always make progress and eventually
2125succeed in completely removing all Pods and other resources created by
2226a failed workload.
2327
24-
2528``` mermaid!
2629---
2730title: Overview of AppWrapper Fault Tolerance Phase Transitions
@@ -89,6 +92,8 @@ following conditions are true:
8992 number of Pods to reach the ` Pending ` state.
9093 + It takes longer than the ` WarmupGracePeriod ` for the expected
9194 number of Pods to reach the ` Running ` state.
95+ + If a non-zero number of ` Running ` Pods are using resources
96+ that Autopilot has tagged as unhealthy.
9297 + A top-level resource is missing.
9398 + The status information of a batch/v1 Job or PyTorchJob indicates
9499 that it has failed.
@@ -97,7 +102,7 @@ If a workload is determined to be unhealthy by one of the first three
97102Pod-level conditions above, the AppWrapper controller first waits for
98103a ` FailureGracePeriod ` to allow the primary resource controller an
99104opportunity to react and return the workload to a healthy state. The
100- ` FailureGracePeriod ` is elided by the last two conditions because the
105+ ` FailureGracePeriod ` is elided by the remaining conditions because the
101106primary resource controller is not expected to take any further
102107action. If the ` FailureGracePeriod ` passes and the workload is still
103108unhealthy, the AppWrapper controller will * reset* the workload by
@@ -112,7 +117,8 @@ then the AppWrapper moves into a `Failed` state and its resources are deleted
112117(thus finally releasing its quota). If at any time during this retry loop,
113118an AppWrapper is suspended (ie, Kueue decides to preempt the AppWrapper),
114119the AppWrapper controller will respect this request by proceeding to delete
115- the resources.
120+ the resources. Workload resets that are initiated in response to Autopilot
121+ are subject to the ` RetryLimit ` but do not increment the ` retryCount ` .
116122
117123To support debugging ` Failed ` workloads, an annotation can be added to an
118124AppWrapper that adds a ` DeletionOnFailureGracePeriod ` between the time the
@@ -121,6 +127,13 @@ begins. Since the AppWrapper continues to consume quota during this delayed dele
121127this annotation should be used sparingly and only when interactive debugging of
122128the failed workload is being actively pursued.
123129
130+ An AppWrapper can be annotated as ` autopilotExempt ` to disable the
131+ injection of Autopilot Node anti-affinities into its Pods and the
132+ automatic migration of its Pods away from Nodes with Autopilot tagged
133+ unhealthy resources. This annotation should only be used for workloads
134+ that will be closely monitored by other means to identify and recover from
135+ unhealthy Nodes in the cluster.
136+
124137All child resources for an AppWrapper that successfully completed will be automatically
125138deleted after a ` SuccessTTL ` after the AppWrapper entered the ` Succeeded ` state.
126139
@@ -141,7 +154,35 @@ can be used to customize them.
141154| DeletionOnFailureGracePeriod | 0 Seconds | workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration |
142155| ForcefulDeletionGracePeriod | 10 Minutes | workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration |
143156| SuccessTTL | 7 Days | workload.codeflare.dev.appwrapper/successTTLDuration |
157+ | AutopilotExempt | false | workload.codeflare.dev.appwrapper/autopilotExempt |
144158| GracePeriodMaximum | 24 Hours | Not Applicable |
145159
146160The ` GracePeriodMaximum ` imposes a system-wide upper limit on all other grace periods to
147161limit the potential impact of user-added annotations on overall system utilization.
162+
163+ The set of resources monitored by Autopilot and the associated labels that identify unhealthy
164+ resources can be customized as part of the AppWrapper operator's configuration. The default
165+ Autopilot configuration used by the controller is:
166+ ``` yaml
167+ autopilot :
168+ injectAntiAffinities : true
169+ migrateImpactedWorkloads : true
170+ resourceUnhealthyConfig :
171+ nvidia.com/gpu :
172+ autopilot.ibm.com/gpuhealth : ERR
173+ ` ` `
174+
175+ The ` resourceUnhealthyConfig` is a map from resource names to labels. For this example
176+ configuration, for exactly those Pods that have a non-zero resource request for
177+ ` nvidia.com/gpu` , the AppWrapper controller will automatically inject the stanze below
178+ into the `affinity` portion of their Spec.
179+ ` ` ` yaml
180+ nodeAffinity:
181+ requiredDuringSchedulingIgnoredDuringExecution:
182+ nodeSelectorTerms:
183+ - matchExpressions:
184+ - key: autopilot.ibm.com/gpuhealth
185+ operator: NotIn
186+ values:
187+ - ERR
188+ ` ` `
0 commit comments