|
| 1 | +# WG Reliability Charter |
| 2 | + |
| 3 | +This charter adheres to the conventions described in the [Kubernetes Charter README] |
| 4 | +and uses the Roles and Organization Management outlined in [sig-governance]. |
| 5 | + |
| 6 | +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md |
| 7 | +[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md |
| 8 | + |
| 9 | +## Scope |
| 10 | + |
| 11 | +The Reliability Working Group (WG Reliability) is organized with the goal of |
| 12 | +allowing users to safely use Kubernetes for managing production workloads by |
| 13 | +ensuring Kubernetes is stable and reliable. |
| 14 | + |
| 15 | +### In Scope |
| 16 | + |
| 17 | +- What reliability means for Kubernetes and how to measure it? |
| 18 | +- Measuring Kubernetes reliability in tests |
| 19 | +- Introducing criteria for blocking the release if the reliability is |
| 20 | + below the bar |
| 21 | +- Building a list of end-user outages and reliability issues |
| 22 | + (if applicable with mitigations and/or workarounds) |
| 23 | +- Creating and prioritizing a list of areas that require reliability |
| 24 | + investments |
| 25 | +- Work with relevant SIGs on delivering necessary infrastructure |
| 26 | + (e.g. test frameworks) to unblock further steps |
| 27 | +- Initiate and drive cross-SIG reliability improvements |
| 28 | + |
| 29 | +For all of the above, we will focus on core Kubernetes components and addons. |
| 30 | +Other SIG subprojects/components (e.g. SIG Scheduling descheduler) are out of |
| 31 | +scope. |
| 32 | + |
| 33 | +### Out of scope |
| 34 | + |
| 35 | +- Designing and executing on improvements clearly falling into individual SIG |
| 36 | + responsibilities. |
| 37 | + |
| 38 | +## Special Powers |
| 39 | + |
| 40 | +The Reliability WG will create a proposal that will allow blocking |
| 41 | +feature-oriented contributions from any SIG if requested reliability-related |
| 42 | +improvements are not being addressed. The exact criteria will have to be |
| 43 | +approved by SIG Architecture, SIG Release, SIG Testing and automatically |
| 44 | +enforced. |
| 45 | + |
| 46 | +The exact scope of blocking hasn't yet been decided. There are at least two |
| 47 | +high-level options: blocking PRs and blocking graduation of features. |
| 48 | +Conformance vs everything enabled by default has to be explicitly defined). |
| 49 | +As a result, the mechanics of blocking hasn't been decided as they will |
| 50 | +heavily depend on the exact scope. As mentioned above, all of those will have |
| 51 | +to be explicitly approved by SIGs mentioned above. |
| 52 | + |
| 53 | +The blocking criteria (once approved) will be passed to SIG Architecture |
| 54 | +Production Readiness subproject or SIG Architecture generally for reassignment |
| 55 | +at the lead's discretion. |
| 56 | + |
| 57 | +Note that ideally the criteria should be extendable to other areas (e.g. |
| 58 | +security), but that's not the goal by itself. |
| 59 | + |
| 60 | +## Stakeholders |
| 61 | + |
| 62 | +Stakeholders in this working group span multiple SIGs. |
| 63 | + |
| 64 | +In the first phase of defining reliability for Kubernetes building list of |
| 65 | +reliability gaps and areas for investments the following SIGs will be |
| 66 | +involved: |
| 67 | + |
| 68 | +- SIG Architecture |
| 69 | + High-level input on requirements. |
| 70 | +- SIG Scalability |
| 71 | + Input on scale test gaps and reliability issues at scale. |
| 72 | +- SIG Cluster Lifecycle |
| 73 | + Input on cluster setup and upgrade mechanics. |
| 74 | +- SIG Release |
| 75 | + Input on blocking and soak requirements. |
| 76 | +- SIG Testing |
| 77 | + Input on testing mechanics, missing frameworks, etc. |
| 78 | +- SIG * |
| 79 | + Input on reliability gaps in their areas. |
| 80 | + |
| 81 | +The group will be also reaching out to users and cluster operator |
| 82 | +(e.g. via surveys), to build the full picture. We will likely leverage |
| 83 | +the CNCF end-user group for this purpose. |
| 84 | + |
| 85 | +In the later phase improving reliability, every single SIG may potentially |
| 86 | +be involved depending on the findings from the initial phase. |
| 87 | + |
| 88 | +## Deliverables |
| 89 | + |
| 90 | +The artifacts the group is supposed to deliver include: |
| 91 | +- Document defining what reliability means for Kubernetes and how to measure it. |
| 92 | +- List of known user outages and potential failure modes |
| 93 | +- List of specific investmenets that should happen to improve reliability |
| 94 | +- Set of processes to introduce in Kubernetes to avoid over time degradation |
| 95 | + of reliability |
| 96 | + |
| 97 | +The actual investments will be owned by corresponding SIGs. |
| 98 | + |
| 99 | +## Roles and Organization Management |
| 100 | + |
| 101 | +This sig follows adheres to the Roles and Organization Management outlined in |
| 102 | +[sig-governance] and opts-in to updates and modifications to [sig-governance]. |
| 103 | + |
| 104 | +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md |
| 105 | + |
| 106 | +## Timelines and Disbanding |
| 107 | + |
| 108 | +The exact timeline for existing of this working group is hard to predict at |
| 109 | +this time. |
| 110 | + |
| 111 | +The group will start working on the deliverables mentioned above. Once the |
| 112 | +group we will be satisfied with the current shape of them and no additional |
| 113 | +coordination on their execution will be needed, we will retire Working Group |
| 114 | +and pass oversight of reliability to SIG Architecture PRR subproject. |
0 commit comments