|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Spotlight on SIG Architecture: Production Readiness" |
| 4 | +slug: sig-architecture-production-readiness-spotlight-2023 |
| 5 | +date: 2023-11-02 |
| 6 | +canonicalUrl: https://www.k8s.dev/blog/2023/11/02/sig-architecture-production-readiness-spotlight-2023/ |
| 7 | +--- |
| 8 | + |
| 9 | +**Author**: Frederico Muñoz (SAS Institute) |
| 10 | + |
| 11 | +_This is the second interview of a SIG Architecture Spotlight series that will cover the different |
| 12 | +subprojects. In this blog, we will cover the [SIG Architecture: Production Readiness |
| 13 | +subproject](https://github.com/kubernetes/community/blob/master/sig-architecture/README.md#production-readiness-1)_. |
| 14 | + |
| 15 | +In this SIG Architecture spotlight, we talked with [Wojciech Tyczynski](https://github.com/wojtek-t) |
| 16 | +(Google), lead of the Production Readiness subproject. |
| 17 | + |
| 18 | +## About SIG Architecture and the Production Readiness subproject |
| 19 | + |
| 20 | +**Frederico (FSM)**: Hello Wojciech, could you tell us a bit about yourself, your role and how you |
| 21 | +got involved in Kubernetes? |
| 22 | + |
| 23 | +**Wojciech Tyczynski (WT)**: I started contributing to Kubernetes in January 2015. At that time, |
| 24 | +Google (where I was and still am working) decided to start a Kubernetes team in the Warsaw office |
| 25 | +(in addition to already existing teams in California and Seattle). I was lucky enough to be one of |
| 26 | +the seeding engineers for that team. |
| 27 | + |
| 28 | +After two months of onboarding and helping with different tasks across the project towards 1.0 |
| 29 | +launch, I took ownership of the scalability area and I was leading Kubernetes to support clusters |
| 30 | +with 5000 nodes. I’m still involved in [SIG Scalability](https://github.com/kubernetes/community/blob/master/sig-scalability/README.md) |
| 31 | +as its Technical Lead. That was the start of a journey since scalability is such a cross-cutting topic, |
| 32 | +and I started contributing to many other areas including, over time, to SIG Architecture. |
| 33 | + |
| 34 | +**FSM**: In SIG Architecture, why specifically the Production Readiness subproject? Was it something |
| 35 | +you had in mind from the start, or was it an unexpected consequence of your initial involvement in |
| 36 | +scalability? |
| 37 | + |
| 38 | +**WT**: After reaching that milestone of [Kubernetes supporting 5000-node clusters](https://kubernetes.io/blog/2017/03/scalability-updates-in-kubernetes-1-6/), |
| 39 | +one of the goals was to ensure that Kubernetes would not degrade its scalability properties over time. While |
| 40 | +non-scalable implementation is always fixable, designing non-scalable APIs or contracts is |
| 41 | +problematic. I was looking for a way to ensure that people are thinking about |
| 42 | +scalability when they create new features and capabilities without introducing too much overhead. |
| 43 | + |
| 44 | +This is when I joined forces with [John Belamaric](https://github.com/johnbelamaric) and |
| 45 | +[David Eads](https://github.com/deads2k) and created a Production Readiness subproject within SIG |
| 46 | +Architecture. While setting the bar for scalability was only one of a few motivations for it, it |
| 47 | +ended up fitting quite well. At the same time, I was already involved in the overall reliability of |
| 48 | +the system internally, so other goals of Production Readiness were also close to my heart. |
| 49 | + |
| 50 | +**FSM**: To anyone new to how SIG Architecture works, how would you describe the main goals and |
| 51 | +areas of intervention of the Production Readiness subproject? |
| 52 | + |
| 53 | +**WT**: The goal of the Production Readiness subproject is to ensure that any feature that is added |
| 54 | +to Kubernetes can be reliably used in production clusters. This primarily means that those features |
| 55 | +are observable, scalable, supportable, can always be safely enabled and in case of production issues |
| 56 | +also disabled. |
| 57 | + |
| 58 | +## Production readiness and the Kubernetes project |
| 59 | + |
| 60 | +**FSM**: Architectural consistency being one of the goals of the SIG, is this made more challenging |
| 61 | +by the [distributed and open nature of Kubernetes](https://www.cncf.io/reports/kubernetes-project-journey-report/)? |
| 62 | +Do you feel this impacts the approach that Production Readiness has to take? |
| 63 | + |
| 64 | +**WT**: The distributed nature of Kubernetes certainly impacts Production Readiness, because it |
| 65 | +makes thinking about aspects like enablement/disablement or scalability more challenging. To be more |
| 66 | +precise, when enabling or disabling features that span multiple components you need to think about |
| 67 | +version skew between them and design for it. For scalability, changes in one component may actually |
| 68 | +result in problems for a completely different one, so it requires a good understanding of the whole |
| 69 | +system, not just individual components. But it’s also what makes this project so interesting. |
| 70 | + |
| 71 | +**FSM**: Those running Kubernetes in production will have their own perspective on things, how do |
| 72 | +you capture this feedback? |
| 73 | + |
| 74 | +**WT**: Fortunately, we aren’t talking about _"them"_ here, we’re talking about _"us"_: all of us are |
| 75 | +working for companies that are managing large fleets of Kubernetes clusters and we’re involved in |
| 76 | +that too, so we suffer from those problems ourselves. |
| 77 | + |
| 78 | +So while we’re trying to get feedback (our annual PRR survey is very important for us), it rarely |
| 79 | +reveals completely new problems - it rather shows the scale of them. And we try to react to it - |
| 80 | +changes like "Beta APIs off by default" happen in reaction to the data that we observe. |
| 81 | + |
| 82 | +**FSM**: On the topic of reaction, that made me think of how the [Kubernetes Enhancement Proposal (KEP)](https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md) |
| 83 | +template has a Production Readiness Review (PRR) section, which is tied to the graduation |
| 84 | +process. Was this something born out of identified insufficiencies? How would you describe the |
| 85 | +results? |
| 86 | + |
| 87 | +**WT**: As mentioned above, the overall goal of the Production Readiness subproject is to ensure |
| 88 | +that every newly added feature can be reliably used in production. It’s not possible to enforce that |
| 89 | +by a central team - we need to make it everyone's problem. |
| 90 | + |
| 91 | +To achieve it, we wanted to ensure that everyone designing their new feature is thinking about safe |
| 92 | +enablement, scalability, observability, supportability, etc. from the very beginning. Which means |
| 93 | +not when the implementation starts, but rather during the design. Given that KEPs are effectively |
| 94 | +Kubernetes design docs, making it part of the KEP template was the way to achieve the goal. |
| 95 | + |
| 96 | +**FSM**: So, in a way making sure that feature owners have thought about the implications of their |
| 97 | +proposal. |
| 98 | + |
| 99 | +**WT**: Exactly. We already observed that just by forcing feature owners to think through the PRR |
| 100 | +aspects (via forcing them to fill in the PRR questionnaire) many of the original issues are going |
| 101 | +away. Sure - as PRR approvers we’re still catching gaps, but even the initial versions of KEPs are |
| 102 | +better now than they used to be a couple of years ago in what concerns thinking about |
| 103 | +productionisation aspects, which is exactly what we wanted to achieve - spreading the culture of |
| 104 | +thinking about reliability in its widest possible meaning. |
| 105 | + |
| 106 | +**FSM**: We've been talking about the PRR process, could you describe it for our readers? |
| 107 | + |
| 108 | +**WT**: The [PRR process](https://github.com/kubernetes/community/blob/master/sig-architecture/production-readiness.md) |
| 109 | +is fairly simple - we just want to ensure that you think through the productionisation aspects of |
| 110 | +your feature early enough. If you do your job, it’s just a matter of answering some questions in the |
| 111 | +KEP template and getting approval from a PRR approver (in addition to regular SIG approval). If you |
| 112 | +didn’t think about those aspects earlier, it may require spending more time and potentially revising |
| 113 | +some decisions, but that’s exactly what we need to make the Kubernetes project reliable. |
| 114 | + |
| 115 | +## Helping with Production Readiness |
| 116 | + |
| 117 | +**FSM**: Production Readiness seems to be one area where a good deal of prior exposure is required |
| 118 | +in order to be an effective contributor. Are there also ways for someone newer to the project to |
| 119 | +contribute? |
| 120 | + |
| 121 | +**WT**: PRR approvers have to have a deep understanding of the whole Kubernetes project to catch |
| 122 | +potential issues. Kubernetes is such a large project now with so many nuances that people who are |
| 123 | +new to the project can simply miss the context, no matter how senior they are. |
| 124 | + |
| 125 | +That said, there are many ways that you may implicitly help. Increasing the reliability of |
| 126 | +particular areas of the project by improving its observability and debuggability, increasing test |
| 127 | +coverage, and building new kinds of tests (upgrade, downgrade, chaos, etc.) will help us a lot. Note |
| 128 | +that the PRR subproject is focused on keeping the bar at the design level, but we should also care |
| 129 | +equally about the implementation. For that, we’re relying on individual SIGs and code approvers, so |
| 130 | +having people there who are aware of productionisation aspects, and who deeply care about it, will |
| 131 | +help the project a lot. |
| 132 | + |
| 133 | +**FSM**: Thank you! Any final comments you would like to share with our readers? |
| 134 | + |
| 135 | +**WT**: I would like to highlight and thank all contributors for their cooperation. While the PRR |
| 136 | +adds some additional work for them, we see that people care about it, and what’s even more |
| 137 | +encouraging is that with every release the quality of the answers improves, and questions "do I |
| 138 | +really need a metric reflecting if my feature works" or "is downgrade really that important" don’t |
| 139 | +really appear anymore. |
0 commit comments