Skip to content

Commit b10035f

Browse files
Update keps/sig-node/4680-add-resource-health-to-pod-status/README.md
Co-authored-by: John Belamaric <[email protected]>
1 parent 04188cd commit b10035f

File tree

1 file changed

+1
-1
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+1
-1
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ Today it is difficult to know when a Pod is using a device that has failed or is
7575

7676
Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.
7777

78-
Today, the typical design is for jobs consuming a failing device to fail itself with the specific error code whenever possible. For the inference of long running workloads, k8s will keep restarting the workload without reallocating it on a different device. So container will be in crash loop backoff with limited information on why it is crashing.
78+
Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.
7979

8080
People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
8181

0 commit comments

Comments
 (0)