Merge pull request #43474 from fsmunoz/sig-arch-prod-readiness-spotlight

k8s-ci-robot · web-flow · commit f259baeba2f6 · 2023-10-27T16:40:43.000+02:00
Add SIG Architecture Production Readiness spotlight
diff --git a/content/en/blog/_posts/2023-11-02-sig-architecture-prod-readiness-spotlight.md b/content/en/blog/_posts/2023-11-02-sig-architecture-prod-readiness-spotlight.md
@@ -0,0 +1,139 @@
+---
+layout: blog
+title: "Spotlight on SIG Architecture: Production Readiness"
+slug: sig-architecture-production-readiness-spotlight-2023
+date: 2023-11-02
+canonicalUrl: https://www.k8s.dev/blog/2023/11/02/sig-architecture-production-readiness-spotlight-2023/
+---
+
+**Author**: Frederico Muñoz (SAS Institute)
+
+_This is the second interview of a SIG Architecture Spotlight series that will cover the different
+subprojects. In this blog, we will cover the [SIG Architecture: Production Readiness
+subproject](https://github.com/kubernetes/community/blob/master/sig-architecture/README.md#production-readiness-1)_.
+
+In this SIG Architecture spotlight, we talked with [Wojciech Tyczynski](https://github.com/wojtek-t)
+(Google), lead of the Production Readiness subproject.
+
+## About SIG Architecture and the Production Readiness subproject
+
+**Frederico (FSM)**: Hello Wojciech, could you tell us a bit about yourself, your role and how you
+got involved in Kubernetes?
+
+**Wojciech Tyczynski (WT)**: I started contributing to Kubernetes in January 2015. At that time,
+Google (where I was and still am working) decided to start a Kubernetes team in the Warsaw office
+(in addition to already existing teams in California and Seattle). I was lucky enough to be one of
+the seeding engineers for that team.
+
+After two months of onboarding and helping with different tasks across the project towards 1.0
+launch, I took ownership of the scalability area and I was leading Kubernetes to support clusters
+with 5000 nodes. I’m still involved in [SIG Scalability](https://github.com/kubernetes/community/blob/master/sig-scalability/README.md)
+as its Technical Lead. That was the start of a journey since scalability is such a cross-cutting topic,
+and I started contributing to many other areas including, over time, to SIG Architecture.
+
+**FSM**: In SIG Architecture, why specifically the Production Readiness subproject? Was it something
+you had in mind from the start, or was it an unexpected consequence of your initial involvement in
+scalability?
+
+**WT**: After reaching that milestone of [Kubernetes supporting 5000-node clusters](https://kubernetes.io/blog/2017/03/scalability-updates-in-kubernetes-1-6/),
+one of the goals was to ensure that Kubernetes would not degrade its scalability properties over time. While
+non-scalable implementation is always fixable, designing non-scalable APIs or contracts is
+problematic. I was looking for a way to ensure that people are thinking about
+scalability when they create new features and capabilities without introducing too much overhead.
+
+This is when I joined forces with [John Belamaric](https://github.com/johnbelamaric) and
+[David Eads](https://github.com/deads2k) and created a Production Readiness subproject within SIG
+Architecture. While setting the bar for scalability was only one of a few motivations for it, it
+ended up fitting quite well. At the same time, I was already involved in the overall reliability of
+the system internally, so other goals of Production Readiness were also close to my heart.
+
+**FSM**: To anyone new to how SIG Architecture works, how would you describe the main goals and
+areas of intervention of the Production Readiness subproject?
+
+**WT**: The goal of the Production Readiness subproject is to ensure that any feature that is added
+to Kubernetes can be reliably used in production clusters. This primarily means that those features
+are observable, scalable, supportable, can always be safely enabled and in case of production issues
+also disabled.
+
+## Production readiness and the Kubernetes project
+
+**FSM**: Architectural consistency being one of the goals of the SIG, is this made more challenging
+by the [distributed and open nature of Kubernetes](https://www.cncf.io/reports/kubernetes-project-journey-report/)?
+Do you feel this impacts the approach that Production Readiness has to take?
+
+**WT**: The distributed nature of Kubernetes certainly impacts Production Readiness, because it
+makes thinking about aspects like enablement/disablement or scalability more challenging. To be more
+precise, when enabling or disabling features that span multiple components you need to think about
+version skew between them and design for it. For scalability, changes in one component may actually
+result in problems for a completely different one, so it requires a good understanding of the whole
+system, not just individual components. But it’s also what makes this project so interesting.
+
+**FSM**: Those running Kubernetes in production will have their own perspective on things, how do
+you capture this feedback?
+
+**WT**: Fortunately, we aren’t talking about _"them"_ here, we’re talking about _"us"_: all of us are
+working for companies that are managing large fleets of Kubernetes clusters and we’re involved in
+that too, so we suffer from those problems ourselves.
+
+So while we’re trying to get feedback (our annual PRR survey is very important for us), it rarely
+reveals completely new problems - it rather shows the scale of them. And we try to react to it -
+changes like "Beta APIs off by default" happen in reaction to the data that we observe.
+
+**FSM**: On the topic of reaction, that made me think of how the [Kubernetes Enhancement Proposal (KEP)](https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md)
+template has a Production Readiness Review (PRR) section, which is tied to the graduation
+process. Was this something born out of identified insufficiencies? How would you describe the
+results?
+
+**WT**: As mentioned above, the overall goal of the Production Readiness subproject is to ensure
+that every newly added feature can be reliably used in production. It’s not possible to enforce that
+by a central team - we need to make it everyone's problem.
+
+To achieve it, we wanted to ensure that everyone designing their new feature is thinking about safe
+enablement, scalability, observability, supportability, etc. from the very beginning. Which means
+not when the implementation starts, but rather during the design. Given that KEPs are effectively
+Kubernetes design docs, making it part of the KEP template was the way to achieve the goal.
+
+**FSM**: So, in a way making sure that feature owners have thought about the implications of their
+proposal.
+
+**WT**: Exactly. We already observed that just by forcing feature owners to think through the PRR
+aspects (via forcing them to fill in the PRR questionnaire) many of the original issues are going
+away. Sure - as PRR approvers we’re still catching gaps, but even the initial versions of KEPs are
+better now than they used to be a couple of years ago in what concerns thinking about
+productionisation aspects, which is exactly what we wanted to achieve - spreading the culture of
+thinking about reliability in its widest possible meaning.
+
+**FSM**: We've been talking about the PRR process, could you describe it for our readers?
+
+**WT**: The [PRR process](https://github.com/kubernetes/community/blob/master/sig-architecture/production-readiness.md)
+is fairly simple - we just want to ensure that you think through the productionisation aspects of
+your feature early enough. If you do your job, it’s just a matter of answering some questions in the
+KEP template and getting approval from a PRR approver (in addition to regular SIG approval). If you
+didn’t think about those aspects earlier, it may require spending more time and potentially revising
+some decisions, but that’s exactly what we need to make the Kubernetes project reliable.
+
+## Helping with Production Readiness
+
+**FSM**: Production Readiness seems to be one area where a good deal of prior exposure is required
+in order to be an effective contributor. Are there also ways for someone newer to the project to
+contribute?
+
+**WT**: PRR approvers have to have a deep understanding of the whole Kubernetes project to catch
+potential issues. Kubernetes is such a large project now with so many nuances that people who are
+new to the project can simply miss the context, no matter how senior they are.
+
+That said, there are many ways that you may implicitly help. Increasing the reliability of
+particular areas of the project by improving its observability and debuggability, increasing test
+coverage, and building new kinds of tests (upgrade, downgrade, chaos, etc.) will help us a lot. Note
+that the PRR subproject is focused on keeping the bar at the design level, but we should also care
+equally about the implementation. For that, we’re relying on individual SIGs and code approvers, so
+having people there who are aware of productionisation aspects, and who deeply care about it, will
+help the project a lot.
+
+**FSM**: Thank you! Any final comments you would like to share with our readers?
+
+**WT**: I would like to highlight and thank all contributors for their cooperation. While the PRR
+adds some additional work for them, we see that people care about it, and what’s even more
+encouraging is that with every release the quality of the answers improves, and questions "do I
+really need a metric reflecting if my feature works" or "is downgrade really that important" don’t
+really appear anymore.