Skip to content

Commit 782c1f9

Browse files
authored
Merge pull request #47147 from fsmunoz/sig-api-machinery-spotlight
Add SIG API Machinery spotlight
2 parents cbbaa10 + 3fbd098 commit 782c1f9

File tree

1 file changed

+183
-0
lines changed

1 file changed

+183
-0
lines changed
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
---
2+
layout: blog
3+
title: "Spotlight on SIG API Machinery"
4+
slug: sig-api-machinery-spotlight-2024
5+
canonicalUrl: https://www.kubernetes.dev/blog/2024/08/07/sig-api-machinery-spotlight-2024
6+
date: 2024-08-07
7+
author: "Frederico Muñoz (SAS Institute)"
8+
---
9+
10+
We recently talked with [Federico Bongiovanni](https://github.com/fedebongio) (Google) and [David
11+
Eads](https://github.com/deads2k) (Red Hat), Chairs of SIG API Machinery, to know a bit more about
12+
this Kubernetes Special Interest Group.
13+
14+
## Introductions
15+
16+
**Frederico (FSM): Hello, and thank your for your time. To start with, could you tell us about
17+
yourselves and how you got involved in Kubernetes?**
18+
19+
**David**: I started working on
20+
[OpenShift](https://www.redhat.com/en/technologies/cloud-computing/openshift) (the Red Hat
21+
distribution of Kubernetes) in the fall of 2014 and got involved pretty quickly in API Machinery. My
22+
first PRs were fixing kube-apiserver error messages and from there I branched out to `kubectl`
23+
(_kubeconfigs_ are my fault!), `auth` ([RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) and `*Review` APIs are ports
24+
from OpenShift), `apps` (_workqueues_ and _sharedinformers_ for example). Don’t tell the others,
25+
but API Machinery is still my favorite :)
26+
27+
**Federico**: I was not as early in Kubernetes as David, but now it's been more than six years. At
28+
my previous company we were starting to use Kubernetes for our own products, and when I came across
29+
the opportunity to work directly with Kubernetes I left everything and boarded the ship (no pun
30+
intended). I joined Google and Kubernetes in early 2018, and have been involved since.
31+
32+
## SIG Machinery's scope
33+
34+
**FSM: It only takes a quick look at the SIG API Machinery charter to see that it has quite a
35+
significant scope, nothing less than the Kubernetes control plane. Could you describe this scope in
36+
your own words?**
37+
38+
**David**: We own the `kube-apiserver` and how to efficiently use it. On the backend, that includes
39+
its contract with backend storage and how it allows API schema evolution over time. On the
40+
frontend, that includes schema best practices, serialization, client patterns, and controller
41+
patterns on top of all of it.
42+
43+
**Federico**: Kubernetes has a lot of different components, but the control plane has a really
44+
critical mission: it's your communication layer with the cluster and also owns all the extensibility
45+
mechanisms that make Kubernetes so powerful. We can't make mistakes like a regression, or an
46+
incompatible change, because the blast radius is huge.
47+
48+
**FSM: Given this breadth, how do you manage the different aspects of it?**
49+
50+
**Federico**: We try to organize the large amount of work into smaller areas. The working groups and
51+
subprojects are part of it. Different people on the SIG have their own areas of expertise, and if
52+
everything fails, we are really lucky to have people like David, Joe, and Stefan who really are "all
53+
terrain", in a way that keeps impressing me even after all these years. But on the other hand this
54+
is the reason why we need more people to help us carry the quality and excellence of Kubernetes from
55+
release to release.
56+
57+
## An evolving collaboration model
58+
59+
**FSM: Was the existing model always like this, or did it evolve with time - and if so, what would
60+
you consider the main changes and the reason behind them?**
61+
62+
**David**: API Machinery has evolved over time both growing and contracting in scope. When trying
63+
to satisfy client access patterns it’s very easy to add scope both in terms of features and applying
64+
them.
65+
66+
A good example of growing scope is the way that we identified a need to reduce memory utilization by
67+
clients writing controllers and developed shared informers. In developing shared informers and the
68+
controller patterns use them (workqueues, error handling, and listers), we greatly reduced memory
69+
utilization and eliminated many expensive lists. The downside: we grew a new set of capability to
70+
support and effectively took ownership of that area from sig-apps.
71+
72+
For an example of more shared ownership: building out cooperative resource management (the goal of
73+
server-side apply), `kubectl` expanded to take ownership of leveraging the server-side apply
74+
capability. The transition isn’t yet complete, but [SIG
75+
CLI](https://github.com/kubernetes/community/tree/master/sig-cli) manages that usage and owns it.
76+
77+
**FSM: And for the boundary between approaches, do you have any guidelines?**
78+
79+
**David**: I think much depends on the impact. If the impact is local in immediate effect, we advise
80+
other SIGs and let them move at their own pace. If the impact is global in immediate effect without
81+
a natural incentive, we’ve found a need to press for adoption directly.
82+
83+
**FSM: Still on that note, SIG Architecture has an API Governance subproject, is it mostly
84+
independent from SIG API Machinery or are there important connection points?**
85+
86+
**David**: The projects have similar sounding names and carry some impacts on each other, but have
87+
different missions and scopes. API Machinery owns the how and API Governance owns the what. API
88+
conventions, the API approval process, and the final say on individual k8s.io APIs belong to API
89+
Governance. API Machinery owns the REST semantics and non-API specific behaviors.
90+
91+
**Federico**: I really like how David put it: *"API Machinery owns the how and API Governance owns
92+
the what"*: we don't own the actual APIs, but the actual APIs live through us.
93+
94+
## The challenges of Kubernetes popularity
95+
96+
**FSM: With the growth in Kubernetes adoption we have certainly seen increased demands from the
97+
Control Plane: how is this felt and how does it influence the work of the SIG?**
98+
99+
**David**: It’s had a massive influence on API Machinery. Over the years we have often responded to
100+
and many times enabled the evolutionary stages of Kubernetes. As the central orchestration hub of
101+
nearly all capability on Kubernetes clusters, we both lead and follow the community. In broad
102+
strokes I see a few evolution stages for API Machinery over the years, with constantly high
103+
activity.
104+
105+
1. **Finding purpose**: `pre-1.0` up until `v1.3` (up to our first 1000+ nodes/namespaces) or
106+
so. This time was characterized by rapid change. We went through five different versions of our
107+
schemas and rose to meet the need. We optimized for quick, in-tree API evolution (sometimes to
108+
the detriment of longer term goals), and defined patterns for the first time.
109+
110+
2. **Scaling to meet the need**: `v1.3-1.9` (up to shared informers in controllers) or so. When we
111+
started trying to meet customer needs as we gained adoption, we found severe scale limitations in
112+
terms of CPU and memory. This was where we broadened API machinery to include access patterns, but
113+
were still heavily focused on in-tree types. We built the watch cache, protobuf serialization,
114+
and shared caches.
115+
116+
3. **Fostering the ecosystem**: `v1.8-1.21` (up to CRD v1) or so. This was when we designed and wrote
117+
CRDs (the considered replacement for third-party-resources), the immediate needs we knew were
118+
coming (admission webhooks), and evolution to best practices we knew we needed (API schemas).
119+
This enabled an explosion of early adopters willing to work very carefully within the constraints
120+
to enable their use-cases for servicing pods. The adoption was very fast, sometimes outpacing
121+
our capability, and creating new problems.
122+
123+
4. **Simplifying deployments**: `v1.22+`. In the relatively recent past, we’ve been responding to
124+
pressures or running kube clusters at scale with large numbers of sometimes-conflicting ecosystem
125+
projects using our extensions mechanisms. Lots of effort is now going into making platform
126+
extensions easier to write and safer to manage by people who don't hold PhDs in kubernetes. This
127+
started with things like server-side-apply and continues today with features like webhook match
128+
conditions and validating admission policies.
129+
130+
Work in API Machinery has a broad impact across the project and the ecosystem. It’s an exciting
131+
area to work for those able to make a significant time investment on a long time horizon.
132+
133+
## The road ahead
134+
135+
**FSM: With those different evolutionary stages in mind, what would you pinpoint as the top
136+
priorities for the SIG at this time?**
137+
138+
**David:** **Reliability, efficiency, and capability** in roughly that order.
139+
140+
With the increased usage of our `kube-apiserver` and extensions mechanisms, we find that our first
141+
set of extensions mechanisms, while fairly complete in terms of capability, carry significant risks
142+
in terms of potential mis-use with large blast radius. To mitigate these risks, we’re investing in
143+
features that reduce the blast radius for accidents (webhook match conditions) and which provide
144+
alternative mechanisms with lower risk profiles for most actions (validating admission policy).
145+
146+
At the same time, the increased usage has made us more aware of scaling limitations that we can
147+
improve both server and client-side. Efforts here include more efficient serialization (CBOR),
148+
reduced etcd load (consistent reads from cache), and reduced peak memory usage (streaming lists).
149+
150+
And finally, the increased usage has highlighted some long existing
151+
gaps that we’re closing. Things like field selectors for CRDs which
152+
the [Batch Working Group](https://github.com/kubernetes/community/blob/master/wg-batch/README.md)
153+
is eager to leverage and will eventually form the basis for a new way
154+
to prevent trampoline pod attacks from exploited nodes.
155+
156+
## Joining the fun
157+
158+
**FSM: For anyone wanting to start contributing, what's your suggestions?**
159+
160+
**Federico**: SIG API Machinery is not an exception to the Kubernetes motto: **Chop Wood and Carry
161+
Water**. There are multiple weekly meetings that are open to everybody, and there is always more
162+
work to be done than people to do it.
163+
164+
I acknowledge that API Machinery is not easy, and the ramp up will be steep. The bar is high,
165+
because of the reasons we've been discussing: we carry a huge responsibility. But of course with
166+
passion and perseverance many people has ramped up through the years, and we hope more will come.
167+
168+
In terms of concrete opportunities, there is the SIG meeting every two weeks. Everyone is welcome to
169+
attend and listen, see what the group talks about, see what's going on in this release, etc.
170+
171+
Also two times a week, Tuesday and Thursday, we have the public Bug Triage, where we go through
172+
everything new from the last meeting. We've been keeping this practice for more than 7 years
173+
now. It's a great opportunity to volunteer to review code, fix bugs, improve documentation,
174+
etc. Tuesday's it's at 1 PM (PST) and Thursday is on an EMEA friendly time (9:30 AM PST). We are
175+
always looking to improve, and we hope to be able to provide more concrete opportunities to join and
176+
participate in the future.
177+
178+
**FSM: Excellent, thank you! Any final comments you would like to share with our readers?**
179+
180+
**Federico**: As I mentioned, the first steps might be hard, but the reward is also larger. Working
181+
on API Machinery is working on an area of huge impact (millions of users?), and your contributions
182+
will have a direct outcome in the way that Kubernetes works and the way that it's used. For me
183+
that's enough reward and motivation!

0 commit comments

Comments
 (0)