|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Server Side Apply Is Great And You Should Be Using It" |
| 4 | +date: 2022-10-20 |
| 5 | +slug: advanced-server-side-apply |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** Daniel Smith (Google) |
| 9 | + |
| 10 | +Server-side apply (SSA) has now been [GA for a few |
| 11 | +releases](https://kubernetes.io/blog/2021/08/06/server-side-apply-ga/), and I |
| 12 | +have found myself in a number of conversations, recommending that people / teams |
| 13 | +in various situations use it. So I’d like to write down some of those reasons. |
| 14 | + |
| 15 | +## Obvious (and not-so-obvious) benefits of SSA {#benefits} |
| 16 | + |
| 17 | +A list of improvements / niceties you get from switching from various things to |
| 18 | +Server-side apply! |
| 19 | + |
| 20 | +* Versus client-side-apply (that is, plain `kubectl apply`): |
| 21 | + * The system gives you conflicts when you accidentally fight with another |
| 22 | + actor over the value of a field! |
| 23 | + * When combined with --dry-run, there’s no chance of accidentally running a |
| 24 | + client-side dry run instead of a server side dry run. |
| 25 | +* Versus hand-rolling patches: |
| 26 | + * The SSA patch format is extremely natural to write, with no weird syntax. |
| 27 | + It’s just a regular object, but you can (and should) omit any field you |
| 28 | + don’t care about. |
| 29 | + * The old patch format (“strategic merge patch”) was ad-hoc and still has some |
| 30 | + bugs; JSON-patch and JSON merge-patch fail to handle some cases that are |
| 31 | + common in the Kubernetes API, namely lists with items that should be |
| 32 | + recursively merged based on a “name” or other identifying field. |
| 33 | + * There’s also now great [go-language library support](https://kubernetes.io/blog/2021/08/06/server-side-apply-ga/#using-server-side-apply-in-a-controller) |
| 34 | + for building apply calls programmatically! |
| 35 | + * You can use SSA to explicitly delete fields you don’t “own” by setting them |
| 36 | + to `null`, which makes it a feature-complete replacement for all of the old |
| 37 | + patch formats. |
| 38 | +* Versus shelling out to kubectl: |
| 39 | + * You can use the apply api call from any language without shelling out to |
| 40 | + kubectl! |
| 41 | + * As stated above, the [go library has dedicated mechanisms](https://kubernetes.io/blog/2021/08/06/server-side-apply-ga/#server-side-apply-support-in-client-go) |
| 42 | + to make this easy now. |
| 43 | +* Versus GET-modify-PUT: |
| 44 | + * (This one is more complicated and you can skip it if you’ve never written a |
| 45 | + controller!) |
| 46 | + * To use GET-modify-PUT correctly, you have to handle and retry a write |
| 47 | + failure in the case that someone else has modified the object in any way |
| 48 | + between your GET and PUT. This is an “optimistic concurrency failure” when |
| 49 | + it happens. |
| 50 | + * SSA offloads this task to the server– you only have to retry if there’s a |
| 51 | + conflict, and the conflicts you can get are all meaningful, like when you’re |
| 52 | + actually trying to take a field away from another actor in the system. |
| 53 | + * To put it another way, if 10 actors do a GET-modify-PUT cycle at the same |
| 54 | + time, 9 will get an optimistic concurrency failure and have to retry, then |
| 55 | + 8, etc, for up to 50 total GET-PUT attempts in the worst case (that’s .5N^2 |
| 56 | + GET and PUT calls for N actors making simultaneous changes). If the actors |
| 57 | + are using SSA instead, and the changes don’t actually conflict over specific |
| 58 | + fields, then all the changes can go in in any order. Additionally, SSA |
| 59 | + changes can often be done without a GET call at all. That’s only N **apply** |
| 60 | + requests for N actors, which is a drastic improvement! |
| 61 | + |
| 62 | +## How can I use SSA? |
| 63 | + |
| 64 | +### Users |
| 65 | + |
| 66 | +Use `kubectl apply --server-side`! Soon we (SIG API Machinery) hope to make this |
| 67 | +the default and remove the “client side” apply completely! |
| 68 | + |
| 69 | +### Controller authors |
| 70 | + |
| 71 | +There’s two main categories here, but for both of them, **you should probably |
| 72 | +_force conflicts_ when using SSA**. This is because your controller probably |
| 73 | +doesn’t know what to do when some other entity in the system has a different |
| 74 | +desire than your controller about a particular field. (See the [CI/CD |
| 75 | +section](#ci-cd-systems), though!) |
| 76 | + |
| 77 | +#### Controllers that use either a GET-modify-PUT sequence or a PATCH {#get-modify-put-patch-controllers} |
| 78 | + |
| 79 | +This kind of controller GETs an object (possibly from a |
| 80 | +[**watch**](https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes)), |
| 81 | +modifies it, and then PUTs it back to write its changes. Sometimes it constructs |
| 82 | +a custom PATCH, but the semantics are the same. Most existing controllers |
| 83 | +(especially those in-tree) work like this. |
| 84 | + |
| 85 | +If your controller is perfect, great! You don’t need to change it. But if you do |
| 86 | +want to change it, you can take advantage of the new client library’s _extract_ |
| 87 | +workflow– that is, **get** the existing object, extract your existing desires, |
| 88 | +make modifications, and re-**apply**. For many controllers that were computing |
| 89 | +the smallest API changes possible, this will be a minor update to the existing |
| 90 | +implementation. |
| 91 | + |
| 92 | +This workflow avoids the failure mode of accidentally trying to own every field |
| 93 | +in the object, which is what happens if you just GET the object, make changes, |
| 94 | +and then **apply**. (Note that the server will notice you did this and reject |
| 95 | +your change!) |
| 96 | + |
| 97 | +#### Reconstructive controllers |
| 98 | + |
| 99 | +This kind of controller wasn’t really possible prior to SSA. The idea here is to |
| 100 | +(whenever something changes etc) reconstruct from scratch the fields of the |
| 101 | +object as the controller wishes them to be, and then **apply** the change to the |
| 102 | +server, letting it figure out the result. I now recommend that new controllers |
| 103 | +start out this way–it's less fiddly to say what you want an object to look like |
| 104 | +than it is to say how you want it to change. |
| 105 | + |
| 106 | +The client library supports this method of operation by default. |
| 107 | + |
| 108 | +The only downside is that you may end up sending unneeded **apply** requests to |
| 109 | +the API server, even if actually the object already matches your controller’s |
| 110 | +desires. This doesn’t matter if it happens once in a while, but for extremely |
| 111 | +high-throughput controllers, it might cause a performance problem for the |
| 112 | +cluster–specifically, the API server. No-op writes are not written to storage |
| 113 | +(etcd) or broadcast to any watchers, so it’s not really that big of a deal. If |
| 114 | +you’re worried about this anyway, today you could use the method explained in |
| 115 | +the previous section, or you could still do it this way for now, and wait for an |
| 116 | +additional client-side mechanism to suppress zero-change applies. |
| 117 | + |
| 118 | +To get around this downside, why not GET the object and only send your **apply** |
| 119 | +if the object needs it? Surprisingly, it doesn’t help much– a no-op **apply** is |
| 120 | +not very much more work for the API server than an extra GET; and an **apply** |
| 121 | +that changes things is cheaper than that same **apply** with a preceding GET. |
| 122 | +Worse, since it is a distributed system, something could change between your GET |
| 123 | +and **apply**, invalidating your computation. Instead, we can use this |
| 124 | +optimization on an object retrieved from a cache–then it legitimately will |
| 125 | +reduce load on the system (at the cost of a delay when a change is needed and |
| 126 | +the cache is a bit behind). |
| 127 | + |
| 128 | +#### CI/CD systems {#ci-cd-systems} |
| 129 | + |
| 130 | +Continuous integration (CI) and/or continuous deployment (CD) systems are a |
| 131 | +special kind of controller which is doing something like reading manifests from |
| 132 | +source control (such as a Git repo) and automatically pushing them into the |
| 133 | +cluster. Perhaps the CI / CD process first generates manifests from a template, |
| 134 | +then runs some tests, and then deploys a change. Typically, users are the |
| 135 | +entities pushing changes into source control, although that’s not necessarily |
| 136 | +always the case. |
| 137 | + |
| 138 | +Some systems like this continuously reconcile with the cluster, others may only |
| 139 | +operate when a change is pushed to the source control system. The following |
| 140 | +considerations are important for both, but more so for the continuously |
| 141 | +reconciling kind. |
| 142 | + |
| 143 | +CI/CD systems are literally controllers, but for the purpose of **apply**, they |
| 144 | +are more like users, and unlike other controllers, they need to pay attention to |
| 145 | +conflicts. Reasoning: |
| 146 | +* Abstractly, CI/CD systems can change anything, which means they could conflict |
| 147 | + with **any** controller out there. The recommendation that controllers force |
| 148 | + conflicts is assuming that controllers change a limited number of things and |
| 149 | + you can be reasonably sure that they won’t fight with other controllers about |
| 150 | + those things; that’s clearly not the case for CI/CD controllers. |
| 151 | +* Concrete example: imagine the CI/CD system wants `.spec.replicas` for some |
| 152 | + Deployment to be 3, because that is the value that is checked into source |
| 153 | + code; however there is also a HorizontalPodAutoscaler (HPA) that targets the |
| 154 | + same deployment. The HPA computes a target scale and decides that there should |
| 155 | + be 10 replicas. Which should win? I just said that most controllers–including |
| 156 | + the HPA–should ignore conflicts. The HPA has no idea if it has been enabled |
| 157 | + incorrectly, and the HPA has no convenient way of informing users of errors. |
| 158 | +* The other common cause of a CI/CD system getting a conflict is probably when |
| 159 | + it is trying to overwrite a hot-fix (hand-rolled patch) placed there by a |
| 160 | + system admin / SRE / dev-on-call. You almost certainly don’t want to override |
| 161 | + that automatically. |
| 162 | +* Of course, sometimes SRE makes an accidental change, or a dev makes an |
| 163 | + unauthorized change – those you do want to notice and overwrite; however, the |
| 164 | + CI/CD system can’t tell the difference between these last two cases. |
| 165 | + |
| 166 | +Hopefully this convinces you that CI/CD systems need error paths–a way to |
| 167 | +back-propagate these conflict errors to humans; in fact, they should have this |
| 168 | +already, certainly continuous integration systems need some way to report that |
| 169 | +tests are failing. But maybe I can also say something about how _humans_ can |
| 170 | +deal with errors: |
| 171 | +* Reject the hotfix: the (human) administrator of the CI/CD system observes the |
| 172 | + error, and manually force-applies the manifest in question. Then the CI/CD |
| 173 | + system will be able to apply the manifest successfully and become a co-owner. |
| 174 | + |
| 175 | + Optional: then the administrator applies a blank manifest (just the object |
| 176 | + type / namespace / name) to relinquish any fields they became a manager for. |
| 177 | + if this step is omitted, there's some chance the administrator will end up |
| 178 | + owning fields and causing an unwanted future conflict. |
| 179 | + |
| 180 | + Note: why an administrator? We're assuming that developers which ordinarily |
| 181 | + push to the CI/CD system and/or its source control system may not have |
| 182 | + permissions to push directly to the cluster. |
| 183 | +* Accept the hotfix: the author of the change in question sees the conflict, and |
| 184 | + edits their change to accept the value running in production. |
| 185 | +* Accept then reject: as in the accept option, but after that manifest is |
| 186 | + applied, and the CI/CD queue owns everything again (so no conflicts), re-apply |
| 187 | + the original manifest. |
| 188 | +* I can also imagine the CI/CD system permitting you to mark a manifest as |
| 189 | + “force conflicts” somehow– if there’s demand for this we could consider making |
| 190 | + a more standardized way to do this. A rigorous version of this which lets you |
| 191 | + declare exactly which conflicts you intend to force would require support from |
| 192 | + the API server; in lieu of that, you can make a second manifest with only that |
| 193 | + subset of fields. |
| 194 | +* Future work: we could imagine an especially advanced CI/CD system that could |
| 195 | + parse `metadata.managedFields` data to see who or what they are conflicting |
| 196 | + with, over what fields, and decide whether or not to ignore the conflict. In |
| 197 | + fact, this information is also presented in any conflict errors, though |
| 198 | + perhaps not in an easily machine-parseable format. We (SIG API Machinery) |
| 199 | + mostly didn't expect that people would want to take this approach — so we |
| 200 | + would love to know if in fact people want/need the features implied by this |
| 201 | + approach, such as the ability, when **apply**ing to request to override |
| 202 | + certain conflicts but not others. |
| 203 | + |
| 204 | + If this sounds like an approach you'd want to take for your own controller, |
| 205 | + come talk to SIG API Machinery! |
| 206 | + |
| 207 | +Happy **apply**ing! |
| 208 | + |
0 commit comments