Skip to content

Commit 8ae06f6

Browse files
authored
Merge pull request #5170 from wojtek-t/wg-reliability-charter
Create Reliability WG charter
2 parents 293186a + 5013172 commit 8ae06f6

File tree

3 files changed

+117
-0
lines changed

3 files changed

+117
-0
lines changed

sigs.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2775,6 +2775,7 @@ workinggroups:
27752775
Allow users to safely use Kubernetes for managing production workloads by ensuring
27762776
Kubernetes is stable and reliable.
27772777
2778+
charter_link: charter.md
27782779
stakeholder_sigs:
27792780
- Architecture
27802781
- Cluster Lifecycle

wg-reliability/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ To understand how this file is generated, see https://git.k8s.io/community/gener
1010

1111
Allow users to safely use Kubernetes for managing production workloads by ensuring Kubernetes is stable and reliable.
1212

13+
The [charter](charter.md) defines the scope and governance of the Reliability Working Group.
14+
1315
## Stakeholder SIGs
1416
* SIG Architecture
1517
* SIG Cluster Lifecycle

wg-reliability/charter.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# WG Reliability Charter
2+
3+
This charter adheres to the conventions described in the [Kubernetes Charter README]
4+
and uses the Roles and Organization Management outlined in [sig-governance].
5+
6+
[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
7+
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md
8+
9+
## Scope
10+
11+
The Reliability Working Group (WG Reliability) is organized with the goal of
12+
allowing users to safely use Kubernetes for managing production workloads by
13+
ensuring Kubernetes is stable and reliable.
14+
15+
### In Scope
16+
17+
- What reliability means for Kubernetes and how to measure it?
18+
- Measuring Kubernetes reliability in tests
19+
- Introducing criteria for blocking the release if the reliability is
20+
below the bar
21+
- Building a list of end-user outages and reliability issues
22+
(if applicable with mitigations and/or workarounds)
23+
- Creating and prioritizing a list of areas that require reliability
24+
investments
25+
- Work with relevant SIGs on delivering necessary infrastructure
26+
(e.g. test frameworks) to unblock further steps
27+
- Initiate and drive cross-SIG reliability improvements
28+
29+
For all of the above, we will focus on core Kubernetes components and addons.
30+
Other SIG subprojects/components (e.g. SIG Scheduling descheduler) are out of
31+
scope.
32+
33+
### Out of scope
34+
35+
- Designing and executing on improvements clearly falling into individual SIG
36+
responsibilities.
37+
38+
## Special Powers
39+
40+
The Reliability WG will create a proposal that will allow blocking
41+
feature-oriented contributions from any SIG if requested reliability-related
42+
improvements are not being addressed. The exact criteria will have to be
43+
approved by SIG Architecture, SIG Release, SIG Testing and automatically
44+
enforced.
45+
46+
The exact scope of blocking hasn't yet been decided. There are at least two
47+
high-level options: blocking PRs and blocking graduation of features.
48+
Conformance vs everything enabled by default has to be explicitly defined).
49+
As a result, the mechanics of blocking hasn't been decided as they will
50+
heavily depend on the exact scope. As mentioned above, all of those will have
51+
to be explicitly approved by SIGs mentioned above.
52+
53+
The blocking criteria (once approved) will be passed to SIG Architecture
54+
Production Readiness subproject or SIG Architecture generally for reassignment
55+
at the lead's discretion.
56+
57+
Note that ideally the criteria should be extendable to other areas (e.g.
58+
security), but that's not the goal by itself.
59+
60+
## Stakeholders
61+
62+
Stakeholders in this working group span multiple SIGs.
63+
64+
In the first phase of defining reliability for Kubernetes building list of
65+
reliability gaps and areas for investments the following SIGs will be
66+
involved:
67+
68+
- SIG Architecture
69+
High-level input on requirements.
70+
- SIG Scalability
71+
Input on scale test gaps and reliability issues at scale.
72+
- SIG Cluster Lifecycle
73+
Input on cluster setup and upgrade mechanics.
74+
- SIG Release
75+
Input on blocking and soak requirements.
76+
- SIG Testing
77+
Input on testing mechanics, missing frameworks, etc.
78+
- SIG *
79+
Input on reliability gaps in their areas.
80+
81+
The group will be also reaching out to users and cluster operator
82+
(e.g. via surveys), to build the full picture. We will likely leverage
83+
the CNCF end-user group for this purpose.
84+
85+
In the later phase improving reliability, every single SIG may potentially
86+
be involved depending on the findings from the initial phase.
87+
88+
## Deliverables
89+
90+
The artifacts the group is supposed to deliver include:
91+
- Document defining what reliability means for Kubernetes and how to measure it.
92+
- List of known user outages and potential failure modes
93+
- List of specific investmenets that should happen to improve reliability
94+
- Set of processes to introduce in Kubernetes to avoid over time degradation
95+
of reliability
96+
97+
The actual investments will be owned by corresponding SIGs.
98+
99+
## Roles and Organization Management
100+
101+
This sig follows adheres to the Roles and Organization Management outlined in
102+
[sig-governance] and opts-in to updates and modifications to [sig-governance].
103+
104+
[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
105+
106+
## Timelines and Disbanding
107+
108+
The exact timeline for existing of this working group is hard to predict at
109+
this time.
110+
111+
The group will start working on the deliverables mentioned above. Once the
112+
group we will be satisfied with the current shape of them and no additional
113+
coordination on their execution will be needed, we will retire Working Group
114+
and pass oversight of reliability to SIG Architecture PRR subproject.

0 commit comments

Comments
 (0)