Skip to content

Commit 0c40ece

Browse files
fasaxckatcosgroveTim Bannister
committed
Add more suggestions for avoiding deadlocks
Webhooks can cause deadlocks in several ways, expand the list to cover more subtle cases. Co-authored-by: Kat Cosgrove <[email protected]> Co-authored-by: Tim Bannister <[email protected]>
1 parent 3dfa81c commit 0c40ece

File tree

1 file changed

+46
-11
lines changed

1 file changed

+46
-11
lines changed

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

Lines changed: 46 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1236,17 +1236,52 @@ object.
12361236

12371237
### Avoiding deadlocks in self-hosted webhooks
12381238

1239-
A webhook running inside the cluster might cause deadlocks for its own deployment if it is configured
1240-
to intercept resources required to start its own pods.
1241-
1242-
For example, a mutating admission webhook is configured to admit `CREATE` pod requests only if a certain label is set in the
1243-
pod (e.g. `"env": "prod"`). The webhook server runs in a deployment which doesn't set the `"env"` label.
1244-
When a node that runs the webhook server pods
1245-
becomes unhealthy, the webhook deployment will try to reschedule the pods to another node. However the requests will
1246-
get rejected by the existing webhook server since the `"env"` label is unset, and the migration cannot happen.
1247-
1248-
It is recommended to exclude the namespace where your webhook is running with a
1249-
[namespaceSelector](#matching-requests-namespaceselector).
1239+
There are several ways that webhooks can cause deadlocks, where the cluster cannot make progress in
1240+
scheduling pods:
1241+
1242+
* A webhook running inside the cluster might cause deadlocks for its own deployment if it is configured
1243+
to intercept resources required to start its own pods.
1244+
1245+
For example, a mutating admission webhook is configured to admit **create** Pod requests only if a certain label is set in the
1246+
pod (such as `env: "prod"`). However, the webhook server runs as a Deployment that doesn't set the `env` label.
1247+
When a node that runs the webhook server pods
1248+
becomes unhealthy, the webhook deployment will try to reschedule the pods to another node. However the requests will
1249+
get rejected by the existing webhook server since the `env` label is unset, and the replacement Pod
1250+
cannot be created. Eventually, the entire Deployment for the webhook server may become unhealthy.
1251+
1252+
If you use admission webhooks to check Pods, consider excluding the namespace where your webhook
1253+
listener is running, by specifying a
1254+
[namespaceSelector](#matching-requests-namespaceselector).
1255+
1256+
* If the cluster has multiple webhooks configured (possibly from independent applications deployed on
1257+
the cluster), they can form a cycle. Webhook A must be called to process startup of webhook B's
1258+
pods and vice versa. If both webhook A and webhook B ever become unavailable at the same time (for
1259+
example, due to a cluster-wide outage or a node failure where both pods run on the same node)
1260+
deadlock occurs because neither webhook pod can be recreated without the other already running.
1261+
1262+
One way to prevent this is to exclude webhook A's pods from being acted on be webhook B. This
1263+
allows webhook A's pods to start, which in turn allows webhook B's pods to start. If you had a
1264+
third webhook, webhook C, you'd need to exclude both webhook A and webhook B's pods from
1265+
webhook C. This ensures that webhook A can _always_ start, which then allows webhook B's pods
1266+
to start, which in turn allows webhook C's pods to start.
1267+
1268+
If you want to ensure protection that avoids these risks, [ValidatingAdmissionPolicies](/docs/reference/access-authn-authz/validating-admission-policy/)
1269+
can
1270+
provide many protection capabilities without introducing dependency cycles.
1271+
1272+
* Admission webhooks can intercept resources used by critical cluster add-ons, such as CoreDNS,
1273+
network plugins, or storage plugins. These add-ons may be required to schedule or successfully run the
1274+
pods for a particular admission webhook on the cluster. This can cause a deadlock if both the
1275+
webhook and critical add-on is unavailable at the same time.
1276+
1277+
You may wish to exclude cluster infrastructure namespaces from webhooks, or make sure that
1278+
the webhook does not depend on the particular add-on that it acts on. For exmaple, running
1279+
a webhook as a host-networked pod ensures that it does not depend on a networking plugin.
1280+
1281+
If you want to ensure protection for a core add-on / or its namespace,
1282+
[ValidatingAdmissionPolicies](/docs/reference/access-authn-authz/validating-admission-policy/)
1283+
can
1284+
provide many protection capabilities without any dependency on worker nodes and Pods.
12501285

12511286
### Side effects
12521287

0 commit comments

Comments
 (0)