|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Spotlight on SIG API Machinery" |
| 4 | +slug: sig-api-machinery-spotlight-2024 |
| 5 | +canonicalUrl: https://www.kubernetes.dev/blog/2024/08/07/sig-api-machinery-spotlight-2024 |
| 6 | +date: 2024-08-07 |
| 7 | +author: "Frederico Muñoz (SAS Institute)" |
| 8 | +--- |
| 9 | + |
| 10 | +We recently talked with [Federico Bongiovanni](https://github.com/fedebongio) (Google) and [David |
| 11 | +Eads](https://github.com/deads2k) (Red Hat), Chairs of SIG API Machinery, to know a bit more about |
| 12 | +this Kubernetes Special Interest Group. |
| 13 | + |
| 14 | +## Introductions |
| 15 | + |
| 16 | +**Frederico (FSM): Hello, and thank your for your time. To start with, could you tell us about |
| 17 | +yourselves and how you got involved in Kubernetes?** |
| 18 | + |
| 19 | +**David**: I started working on |
| 20 | +[OpenShift](https://www.redhat.com/en/technologies/cloud-computing/openshift) (the Red Hat |
| 21 | +distribution of Kubernetes) in the fall of 2014 and got involved pretty quickly in API Machinery. My |
| 22 | +first PRs were fixing kube-apiserver error messages and from there I branched out to `kubectl` |
| 23 | +(_kubeconfigs_ are my fault!), `auth` ([RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) and `*Review` APIs are ports |
| 24 | +from OpenShift), `apps` (_workqueues_ and _sharedinformers_ for example). Don’t tell the others, |
| 25 | +but API Machinery is still my favorite :) |
| 26 | + |
| 27 | +**Federico**: I was not as early in Kubernetes as David, but now it's been more than six years. At |
| 28 | +my previous company we were starting to use Kubernetes for our own products, and when I came across |
| 29 | +the opportunity to work directly with Kubernetes I left everything and boarded the ship (no pun |
| 30 | +intended). I joined Google and Kubernetes in early 2018, and have been involved since. |
| 31 | + |
| 32 | +## SIG Machinery's scope |
| 33 | + |
| 34 | +**FSM: It only takes a quick look at the SIG API Machinery charter to see that it has quite a |
| 35 | +significant scope, nothing less than the Kubernetes control plane. Could you describe this scope in |
| 36 | +your own words?** |
| 37 | + |
| 38 | +**David**: We own the `kube-apiserver` and how to efficiently use it. On the backend, that includes |
| 39 | +its contract with backend storage and how it allows API schema evolution over time. On the |
| 40 | +frontend, that includes schema best practices, serialization, client patterns, and controller |
| 41 | +patterns on top of all of it. |
| 42 | + |
| 43 | +**Federico**: Kubernetes has a lot of different components, but the control plane has a really |
| 44 | +critical mission: it's your communication layer with the cluster and also owns all the extensibility |
| 45 | +mechanisms that make Kubernetes so powerful. We can't make mistakes like a regression, or an |
| 46 | +incompatible change, because the blast radius is huge. |
| 47 | + |
| 48 | +**FSM: Given this breadth, how do you manage the different aspects of it?** |
| 49 | + |
| 50 | +**Federico**: We try to organize the large amount of work into smaller areas. The working groups and |
| 51 | +subprojects are part of it. Different people on the SIG have their own areas of expertise, and if |
| 52 | +everything fails, we are really lucky to have people like David, Joe, and Stefan who really are "all |
| 53 | +terrain", in a way that keeps impressing me even after all these years. But on the other hand this |
| 54 | +is the reason why we need more people to help us carry the quality and excellence of Kubernetes from |
| 55 | +release to release. |
| 56 | + |
| 57 | +## An evolving collaboration model |
| 58 | + |
| 59 | +**FSM: Was the existing model always like this, or did it evolve with time - and if so, what would |
| 60 | +you consider the main changes and the reason behind them?** |
| 61 | + |
| 62 | +**David**: API Machinery has evolved over time both growing and contracting in scope. When trying |
| 63 | +to satisfy client access patterns it’s very easy to add scope both in terms of features and applying |
| 64 | +them. |
| 65 | + |
| 66 | +A good example of growing scope is the way that we identified a need to reduce memory utilization by |
| 67 | +clients writing controllers and developed shared informers. In developing shared informers and the |
| 68 | +controller patterns use them (workqueues, error handling, and listers), we greatly reduced memory |
| 69 | +utilization and eliminated many expensive lists. The downside: we grew a new set of capability to |
| 70 | +support and effectively took ownership of that area from sig-apps. |
| 71 | + |
| 72 | +For an example of more shared ownership: building out cooperative resource management (the goal of |
| 73 | +server-side apply), `kubectl` expanded to take ownership of leveraging the server-side apply |
| 74 | +capability. The transition isn’t yet complete, but [SIG |
| 75 | +CLI](https://github.com/kubernetes/community/tree/master/sig-cli) manages that usage and owns it. |
| 76 | + |
| 77 | +**FSM: And for the boundary between approaches, do you have any guidelines?** |
| 78 | + |
| 79 | +**David**: I think much depends on the impact. If the impact is local in immediate effect, we advise |
| 80 | +other SIGs and let them move at their own pace. If the impact is global in immediate effect without |
| 81 | +a natural incentive, we’ve found a need to press for adoption directly. |
| 82 | + |
| 83 | +**FSM: Still on that note, SIG Architecture has an API Governance subproject, is it mostly |
| 84 | +independent from SIG API Machinery or are there important connection points?** |
| 85 | + |
| 86 | +**David**: The projects have similar sounding names and carry some impacts on each other, but have |
| 87 | +different missions and scopes. API Machinery owns the how and API Governance owns the what. API |
| 88 | +conventions, the API approval process, and the final say on individual k8s.io APIs belong to API |
| 89 | +Governance. API Machinery owns the REST semantics and non-API specific behaviors. |
| 90 | + |
| 91 | +**Federico**: I really like how David put it: *"API Machinery owns the how and API Governance owns |
| 92 | +the what"*: we don't own the actual APIs, but the actual APIs live through us. |
| 93 | + |
| 94 | +## The challenges of Kubernetes popularity |
| 95 | + |
| 96 | +**FSM: With the growth in Kubernetes adoption we have certainly seen increased demands from the |
| 97 | +Control Plane: how is this felt and how does it influence the work of the SIG?** |
| 98 | + |
| 99 | +**David**: It’s had a massive influence on API Machinery. Over the years we have often responded to |
| 100 | +and many times enabled the evolutionary stages of Kubernetes. As the central orchestration hub of |
| 101 | +nearly all capability on Kubernetes clusters, we both lead and follow the community. In broad |
| 102 | +strokes I see a few evolution stages for API Machinery over the years, with constantly high |
| 103 | +activity. |
| 104 | + |
| 105 | +1. **Finding purpose**: `pre-1.0` up until `v1.3` (up to our first 1000+ nodes/namespaces) or |
| 106 | + so. This time was characterized by rapid change. We went through five different versions of our |
| 107 | + schemas and rose to meet the need. We optimized for quick, in-tree API evolution (sometimes to |
| 108 | + the detriment of longer term goals), and defined patterns for the first time. |
| 109 | + |
| 110 | +2. **Scaling to meet the need**: `v1.3-1.9` (up to shared informers in controllers) or so. When we |
| 111 | + started trying to meet customer needs as we gained adoption, we found severe scale limitations in |
| 112 | + terms of CPU and memory. This was where we broadened API machinery to include access patterns, but |
| 113 | + were still heavily focused on in-tree types. We built the watch cache, protobuf serialization, |
| 114 | + and shared caches. |
| 115 | + |
| 116 | +3. **Fostering the ecosystem**: `v1.8-1.21` (up to CRD v1) or so. This was when we designed and wrote |
| 117 | + CRDs (the considered replacement for third-party-resources), the immediate needs we knew were |
| 118 | + coming (admission webhooks), and evolution to best practices we knew we needed (API schemas). |
| 119 | + This enabled an explosion of early adopters willing to work very carefully within the constraints |
| 120 | + to enable their use-cases for servicing pods. The adoption was very fast, sometimes outpacing |
| 121 | + our capability, and creating new problems. |
| 122 | + |
| 123 | +4. **Simplifying deployments**: `v1.22+`. In the relatively recent past, we’ve been responding to |
| 124 | + pressures or running kube clusters at scale with large numbers of sometimes-conflicting ecosystem |
| 125 | + projects using our extensions mechanisms. Lots of effort is now going into making platform |
| 126 | + extensions easier to write and safer to manage by people who don't hold PhDs in kubernetes. This |
| 127 | + started with things like server-side-apply and continues today with features like webhook match |
| 128 | + conditions and validating admission policies. |
| 129 | + |
| 130 | +Work in API Machinery has a broad impact across the project and the ecosystem. It’s an exciting |
| 131 | +area to work for those able to make a significant time investment on a long time horizon. |
| 132 | + |
| 133 | +## The road ahead |
| 134 | + |
| 135 | +**FSM: With those different evolutionary stages in mind, what would you pinpoint as the top |
| 136 | +priorities for the SIG at this time?** |
| 137 | + |
| 138 | +**David:** **Reliability, efficiency, and capability** in roughly that order. |
| 139 | + |
| 140 | +With the increased usage of our `kube-apiserver` and extensions mechanisms, we find that our first |
| 141 | +set of extensions mechanisms, while fairly complete in terms of capability, carry significant risks |
| 142 | +in terms of potential mis-use with large blast radius. To mitigate these risks, we’re investing in |
| 143 | +features that reduce the blast radius for accidents (webhook match conditions) and which provide |
| 144 | +alternative mechanisms with lower risk profiles for most actions (validating admission policy). |
| 145 | + |
| 146 | +At the same time, the increased usage has made us more aware of scaling limitations that we can |
| 147 | +improve both server and client-side. Efforts here include more efficient serialization (CBOR), |
| 148 | +reduced etcd load (consistent reads from cache), and reduced peak memory usage (streaming lists). |
| 149 | + |
| 150 | +And finally, the increased usage has highlighted some long existing |
| 151 | +gaps that we’re closing. Things like field selectors for CRDs which |
| 152 | +the [Batch Working Group](https://github.com/kubernetes/community/blob/master/wg-batch/README.md) |
| 153 | +is eager to leverage and will eventually form the basis for a new way |
| 154 | +to prevent trampoline pod attacks from exploited nodes. |
| 155 | + |
| 156 | +## Joining the fun |
| 157 | + |
| 158 | +**FSM: For anyone wanting to start contributing, what's your suggestions?** |
| 159 | + |
| 160 | +**Federico**: SIG API Machinery is not an exception to the Kubernetes motto: **Chop Wood and Carry |
| 161 | +Water**. There are multiple weekly meetings that are open to everybody, and there is always more |
| 162 | +work to be done than people to do it. |
| 163 | + |
| 164 | +I acknowledge that API Machinery is not easy, and the ramp up will be steep. The bar is high, |
| 165 | +because of the reasons we've been discussing: we carry a huge responsibility. But of course with |
| 166 | +passion and perseverance many people has ramped up through the years, and we hope more will come. |
| 167 | + |
| 168 | +In terms of concrete opportunities, there is the SIG meeting every two weeks. Everyone is welcome to |
| 169 | +attend and listen, see what the group talks about, see what's going on in this release, etc. |
| 170 | + |
| 171 | +Also two times a week, Tuesday and Thursday, we have the public Bug Triage, where we go through |
| 172 | +everything new from the last meeting. We've been keeping this practice for more than 7 years |
| 173 | +now. It's a great opportunity to volunteer to review code, fix bugs, improve documentation, |
| 174 | +etc. Tuesday's it's at 1 PM (PST) and Thursday is on an EMEA friendly time (9:30 AM PST). We are |
| 175 | +always looking to improve, and we hope to be able to provide more concrete opportunities to join and |
| 176 | +participate in the future. |
| 177 | + |
| 178 | +**FSM: Excellent, thank you! Any final comments you would like to share with our readers?** |
| 179 | + |
| 180 | +**Federico**: As I mentioned, the first steps might be hard, but the reward is also larger. Working |
| 181 | +on API Machinery is working on an area of huge impact (millions of users?), and your contributions |
| 182 | +will have a direct outcome in the way that Kubernetes works and the way that it's used. For me |
| 183 | +that's enough reward and motivation! |
0 commit comments