Skip to content

Commit f259bae

Browse files
authored
Merge pull request #43474 from fsmunoz/sig-arch-prod-readiness-spotlight
Add SIG Architecture Production Readiness spotlight
2 parents 6face83 + f18aa90 commit f259bae

File tree

1 file changed

+139
-0
lines changed

1 file changed

+139
-0
lines changed
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
---
2+
layout: blog
3+
title: "Spotlight on SIG Architecture: Production Readiness"
4+
slug: sig-architecture-production-readiness-spotlight-2023
5+
date: 2023-11-02
6+
canonicalUrl: https://www.k8s.dev/blog/2023/11/02/sig-architecture-production-readiness-spotlight-2023/
7+
---
8+
9+
**Author**: Frederico Muñoz (SAS Institute)
10+
11+
_This is the second interview of a SIG Architecture Spotlight series that will cover the different
12+
subprojects. In this blog, we will cover the [SIG Architecture: Production Readiness
13+
subproject](https://github.com/kubernetes/community/blob/master/sig-architecture/README.md#production-readiness-1)_.
14+
15+
In this SIG Architecture spotlight, we talked with [Wojciech Tyczynski](https://github.com/wojtek-t)
16+
(Google), lead of the Production Readiness subproject.
17+
18+
## About SIG Architecture and the Production Readiness subproject
19+
20+
**Frederico (FSM)**: Hello Wojciech, could you tell us a bit about yourself, your role and how you
21+
got involved in Kubernetes?
22+
23+
**Wojciech Tyczynski (WT)**: I started contributing to Kubernetes in January 2015. At that time,
24+
Google (where I was and still am working) decided to start a Kubernetes team in the Warsaw office
25+
(in addition to already existing teams in California and Seattle). I was lucky enough to be one of
26+
the seeding engineers for that team.
27+
28+
After two months of onboarding and helping with different tasks across the project towards 1.0
29+
launch, I took ownership of the scalability area and I was leading Kubernetes to support clusters
30+
with 5000 nodes. I’m still involved in [SIG Scalability](https://github.com/kubernetes/community/blob/master/sig-scalability/README.md)
31+
as its Technical Lead. That was the start of a journey since scalability is such a cross-cutting topic,
32+
and I started contributing to many other areas including, over time, to SIG Architecture.
33+
34+
**FSM**: In SIG Architecture, why specifically the Production Readiness subproject? Was it something
35+
you had in mind from the start, or was it an unexpected consequence of your initial involvement in
36+
scalability?
37+
38+
**WT**: After reaching that milestone of [Kubernetes supporting 5000-node clusters](https://kubernetes.io/blog/2017/03/scalability-updates-in-kubernetes-1-6/),
39+
one of the goals was to ensure that Kubernetes would not degrade its scalability properties over time. While
40+
non-scalable implementation is always fixable, designing non-scalable APIs or contracts is
41+
problematic. I was looking for a way to ensure that people are thinking about
42+
scalability when they create new features and capabilities without introducing too much overhead.
43+
44+
This is when I joined forces with [John Belamaric](https://github.com/johnbelamaric) and
45+
[David Eads](https://github.com/deads2k) and created a Production Readiness subproject within SIG
46+
Architecture. While setting the bar for scalability was only one of a few motivations for it, it
47+
ended up fitting quite well. At the same time, I was already involved in the overall reliability of
48+
the system internally, so other goals of Production Readiness were also close to my heart.
49+
50+
**FSM**: To anyone new to how SIG Architecture works, how would you describe the main goals and
51+
areas of intervention of the Production Readiness subproject?
52+
53+
**WT**: The goal of the Production Readiness subproject is to ensure that any feature that is added
54+
to Kubernetes can be reliably used in production clusters. This primarily means that those features
55+
are observable, scalable, supportable, can always be safely enabled and in case of production issues
56+
also disabled.
57+
58+
## Production readiness and the Kubernetes project
59+
60+
**FSM**: Architectural consistency being one of the goals of the SIG, is this made more challenging
61+
by the [distributed and open nature of Kubernetes](https://www.cncf.io/reports/kubernetes-project-journey-report/)?
62+
Do you feel this impacts the approach that Production Readiness has to take?
63+
64+
**WT**: The distributed nature of Kubernetes certainly impacts Production Readiness, because it
65+
makes thinking about aspects like enablement/disablement or scalability more challenging. To be more
66+
precise, when enabling or disabling features that span multiple components you need to think about
67+
version skew between them and design for it. For scalability, changes in one component may actually
68+
result in problems for a completely different one, so it requires a good understanding of the whole
69+
system, not just individual components. But it’s also what makes this project so interesting.
70+
71+
**FSM**: Those running Kubernetes in production will have their own perspective on things, how do
72+
you capture this feedback?
73+
74+
**WT**: Fortunately, we aren’t talking about _"them"_ here, we’re talking about _"us"_: all of us are
75+
working for companies that are managing large fleets of Kubernetes clusters and we’re involved in
76+
that too, so we suffer from those problems ourselves.
77+
78+
So while we’re trying to get feedback (our annual PRR survey is very important for us), it rarely
79+
reveals completely new problems - it rather shows the scale of them. And we try to react to it -
80+
changes like "Beta APIs off by default" happen in reaction to the data that we observe.
81+
82+
**FSM**: On the topic of reaction, that made me think of how the [Kubernetes Enhancement Proposal (KEP)](https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md)
83+
template has a Production Readiness Review (PRR) section, which is tied to the graduation
84+
process. Was this something born out of identified insufficiencies? How would you describe the
85+
results?
86+
87+
**WT**: As mentioned above, the overall goal of the Production Readiness subproject is to ensure
88+
that every newly added feature can be reliably used in production. It’s not possible to enforce that
89+
by a central team - we need to make it everyone's problem.
90+
91+
To achieve it, we wanted to ensure that everyone designing their new feature is thinking about safe
92+
enablement, scalability, observability, supportability, etc. from the very beginning. Which means
93+
not when the implementation starts, but rather during the design. Given that KEPs are effectively
94+
Kubernetes design docs, making it part of the KEP template was the way to achieve the goal.
95+
96+
**FSM**: So, in a way making sure that feature owners have thought about the implications of their
97+
proposal.
98+
99+
**WT**: Exactly. We already observed that just by forcing feature owners to think through the PRR
100+
aspects (via forcing them to fill in the PRR questionnaire) many of the original issues are going
101+
away. Sure - as PRR approvers we’re still catching gaps, but even the initial versions of KEPs are
102+
better now than they used to be a couple of years ago in what concerns thinking about
103+
productionisation aspects, which is exactly what we wanted to achieve - spreading the culture of
104+
thinking about reliability in its widest possible meaning.
105+
106+
**FSM**: We've been talking about the PRR process, could you describe it for our readers?
107+
108+
**WT**: The [PRR process](https://github.com/kubernetes/community/blob/master/sig-architecture/production-readiness.md)
109+
is fairly simple - we just want to ensure that you think through the productionisation aspects of
110+
your feature early enough. If you do your job, it’s just a matter of answering some questions in the
111+
KEP template and getting approval from a PRR approver (in addition to regular SIG approval). If you
112+
didn’t think about those aspects earlier, it may require spending more time and potentially revising
113+
some decisions, but that’s exactly what we need to make the Kubernetes project reliable.
114+
115+
## Helping with Production Readiness
116+
117+
**FSM**: Production Readiness seems to be one area where a good deal of prior exposure is required
118+
in order to be an effective contributor. Are there also ways for someone newer to the project to
119+
contribute?
120+
121+
**WT**: PRR approvers have to have a deep understanding of the whole Kubernetes project to catch
122+
potential issues. Kubernetes is such a large project now with so many nuances that people who are
123+
new to the project can simply miss the context, no matter how senior they are.
124+
125+
That said, there are many ways that you may implicitly help. Increasing the reliability of
126+
particular areas of the project by improving its observability and debuggability, increasing test
127+
coverage, and building new kinds of tests (upgrade, downgrade, chaos, etc.) will help us a lot. Note
128+
that the PRR subproject is focused on keeping the bar at the design level, but we should also care
129+
equally about the implementation. For that, we’re relying on individual SIGs and code approvers, so
130+
having people there who are aware of productionisation aspects, and who deeply care about it, will
131+
help the project a lot.
132+
133+
**FSM**: Thank you! Any final comments you would like to share with our readers?
134+
135+
**WT**: I would like to highlight and thank all contributors for their cooperation. While the PRR
136+
adds some additional work for them, we see that people care about it, and what’s even more
137+
encouraging is that with every release the quality of the answers improves, and questions "do I
138+
really need a metric reflecting if my feature works" or "is downgrade really that important" don’t
139+
really appear anymore.

0 commit comments

Comments
 (0)