Skip to content

Commit ed9a8d1

Browse files
SIG Node CI Subproject Celebrates Two Years of Test Improvements
1 parent 996ac4d commit ed9a8d1

File tree

3 files changed

+190
-0
lines changed

3 files changed

+190
-0
lines changed
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
---
2+
layout: blog
3+
title: 'SIG Node CI Subproject Celebrates Two Years of Test Improvements'
4+
date: 2022-02-16
5+
---
6+
7+
**Authors:** Sergey Kanzhelev (Google), Elana Hashman (Red Hat)
8+
9+
Ensuring the reliability of SIG Node upstream code is a continuous effort
10+
that takes a lot of behind-the-scenes effort from many contributors.
11+
There are frequent releases of Kubernetes, base operating systems,
12+
container runtimes, and test infrastructure that result in a complex matrix that
13+
requires attention and steady investment to "keep the lights on."
14+
In May 2020, the Kubernetes node special interest group ("SIG Node") organized a new
15+
subproject for continuous integration (CI) for node-related code and tests. Since its
16+
inauguration, the SIG Node CI subproject has run a weekly meeting, and even the full hour
17+
is often not enough to complete triage of all bugs, test-related PRs and issues, and discuss all
18+
related ongoing work within the subgroup.
19+
20+
Over the past two years, we've fixed merge-blocking and release-blocking tests, reducing time to merge Kubernetes contributors' pull requests thanks to reduced test flakes. When we started, Node test jobs only passed 42% of the time, and through our efforts, we now ensure a consistent >90% job pass rate. We've closed 144 test failure issues and merged 176 pull requests just in kubernetes/kubernetes. And we've helped subproject participants ascend the Kubernetes contributor ladder, with 3 new org members, 6 new reviewers, and 2 new approvers.
21+
22+
The Node CI subproject is an approachable first stop to help new contributors
23+
get started with SIG Node. There is a low barrier to entry for new contributors
24+
to address high-impact bugs and test fixes, although there is a long
25+
road before contributors can climb the entire contributor ladder:
26+
it took over a year to establish two new approvers for the group.
27+
The complexity of all the different components that power Kubernetes nodes
28+
and its test infrastructure requires a sustained investment over a long period
29+
for developers to deeply understand the entire system,
30+
both at high and low levels of detail.
31+
32+
We have several regular contributors at our meetings, however; our reviewers
33+
and approvers pool is still small. It is our goal to continue to grow
34+
contributors to ensure a sustainable distribution of work
35+
that does not just fall to a few key approvers.
36+
37+
It's not always obvious how subprojects within SIGs are formed, operate,
38+
and work. Each is unique to its sponsoring SIG and tailored to the projects
39+
that the group is intended to support. As a group that has welcomed many
40+
first-time SIG Node contributors, we'd like to share some of the details and
41+
accomplishments over the past two years,
42+
helping to demystify our inner workings and celebrate the hard work
43+
of all our dedicated contributors!
44+
45+
## Timeline
46+
47+
***May 2020.*** SIG Node CI group was formed on May 11, 2020, with more than
48+
[30 volunteers](https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#bookmark=id.vsb8pqnf4gib)
49+
signed up, to improve SIG Node CI signal and overall observability.
50+
Victor Pickard focused on getting
51+
[testgrid jobs](https://testgrid.k8s.io/sig-node) passing
52+
when Ning Liao suggested forming a group around this effort and came up with
53+
the [original group charter document](https://docs.google.com/document/d/1yS-XoUl6GjZdjrwxInEZVHhxxLXlTIX2CeWOARmD8tY/edit#heading=h.te6sgum6s8uf).
54+
The SIG Node chairs sponsored group creation with Victor as a subproject lead.
55+
Sergey Kanzhelev joined Victor shortly after as a co-lead.
56+
57+
At the kick-off meeting, we discussed which tests to concentrate on fixing first
58+
and discussed merge-blocking and release-blocking tests, many of which were failing due
59+
to infrastructure issues or buggy test code.
60+
61+
The subproject launched weekly hour-long meetings to discuss ongoing work
62+
discussion and triage.
63+
64+
***June 2020.*** Morgan Bauer, Karan Goel, and Jorge Alarcon Ochoa were
65+
recognized as reviewers for the SIG Node CI group for their contributions,
66+
helping significantly with the early stages of the subproject.
67+
David Porter and Roy Yang also joined the SIG test failures GitHub team.
68+
69+
***August 2020.*** All merge-blocking and release-blocking tests were passing,
70+
with some flakes. However, only 42% of all SIG Node test jobs were green, as there
71+
were many flakes and failing tests.
72+
73+
***October 2020.*** Amim Knabben becomes a Kubernetes org member for his
74+
contributions to the subproject.
75+
76+
***January 2021.*** With healthy presubmit and critical periodic jobs passing,
77+
the subproject discussed its goal for cleaning up the rest of periodic tests
78+
and ensuring they passed without flakes.
79+
80+
Elana Hashman joined the subproject, stepping up to help lead it after
81+
Victor's departure.
82+
83+
***February 2021.*** Artyom Lukianov becomes a Kubernetes org member for his
84+
contributions to the subproject.
85+
86+
***August 2021.*** After SIG Node successfully ran a [bug scrub](https://groups.google.com/g/kubernetes-dev/c/w2ghO4ihje0/m/VeEql1LJBAAJ)
87+
to clean up its bug backlog, the scope of the meeting was extended to
88+
include bug triage to increase overall reliability, anticipating issues
89+
before they affect the CI signal.
90+
91+
Subproject leads Elana Hashman and Sergey Kanzhelev are both recognized as
92+
approvers on all node test code, supported by SIG Node and SIG Testing.
93+
94+
***September 2021.*** After significant deflaking progress with serial tests in
95+
the 1.22 release spearheaded by Francesco Romani, the subproject set a goal
96+
for getting the serial job fully passing by the 1.23 release date.
97+
98+
Mike Miranda becomes a Kubernetes org member for his contributions
99+
to the subproject.
100+
101+
***November 2021.*** Throughout 2021, SIG Node had no merge or
102+
release-blocking test failures. Many flaky tests from past releases are removed
103+
from release-blocking dashboards as they had been fully cleaned up.
104+
105+
Danielle Lancashire was recognized as a reviewer for SIG Node's subgroup, test code.
106+
107+
The final node serial tests were completely fixed. The serial tests consist of
108+
many disruptive and slow tests which tend to be flakey and are hard
109+
to troubleshoot. By the 1.23 release freeze, the last serial tests were
110+
fixed and the job was passing without flakes.
111+
112+
[![Slack announcement that Serial tests are green](serial-tests-green.png)](https://kubernetes.slack.com/archives/C0BP8PW9G/p1638211041322900)
113+
114+
The 1.23 release got a special shout out for the tests quality and CI signal.
115+
The SIG Node CI subproject was proud to have helped contribute to such
116+
a high-quality release, in part due to our efforts in identifying
117+
and fixing flakes in Node and beyond.
118+
119+
[![Slack shoutout that release was mostly green](release-mostly-green.png)](https://kubernetes.slack.com/archives/C92G08FGD/p1637175755023200)
120+
121+
***December 2021.*** An estimated 90% of test jobs were passing at the time of
122+
the 1.23 release (up from 42% in August 2020).
123+
124+
Dockershim code was removed from Kubernetes. This affected nearly half of SIG Node's
125+
test jobs and the SIG Node CI subproject reacted quickly and retargeted all the
126+
tests. SIG Node was the first SIG to complete test migrations off dockershim,
127+
providing examples for other affected SIGs. The vast majority of new jobs passed
128+
at the time of introduction without further fixes required. The [effort of
129+
removing dockershim](https://k8s.io/dockershim)) from Kubernetes is ongoing.
130+
There are still some wrinkles from the dockershim removal as we uncover more
131+
dependencies on dockershim, but we plan to stabilize all test jobs
132+
by the 1.24 release.
133+
134+
## Statistics
135+
136+
Our regular meeting attendees and subproject participants for the past few months:
137+
138+
- Aditi Sharma
139+
- Artyom Lukianov
140+
- Arnaud Meukam
141+
- Danielle Lancashire
142+
- David Porter
143+
- Davanum Srinivas
144+
- Elana Hashman
145+
- Francesco Romani
146+
- Matthias Bertschy
147+
- Mike Miranda
148+
- Paco Xu
149+
- Peter Hunt
150+
- Ruiwen Zhao
151+
- Ryan Phillips
152+
- Sergey Kanzhelev
153+
- Skyler Clark
154+
- Swati Sehgal
155+
- Wenjun Wu
156+
157+
The [kubernetes/test-infra](https://github.com/kubernetes/test-infra/) source code repository contains test definitions. The number of
158+
Node PRs just in that repository:
159+
- 2020 PRs (since May): [183](https://github.com/kubernetes/test-infra/pulls?q=is%3Apr+is%3Aclosed+label%3Asig%2Fnode+created%3A2020-05-01..2020-12-31+-author%3Ak8s-infra-ci-robot+)
160+
- 2021 PRs: [264](https://github.com/kubernetes/test-infra/pulls?q=is%3Apr+is%3Aclosed+label%3Asig%2Fnode+created%3A2021-01-01..2021-12-31+-author%3Ak8s-infra-ci-robot+)
161+
162+
Triaged issues and PRs on CI board (including triaging away from the subgroup scope):
163+
164+
- 2020 (since May): [132](https://github.com/issues?q=project%3Akubernetes%2F43+created%3A2020-05-01..2020-12-31)
165+
- 2021: [532](https://github.com/issues?q=project%3Akubernetes%2F43+created%3A2021-01-01..2021-12-31+)
166+
167+
## Future
168+
169+
Just "keeping the lights on" is a bold task and we are committed to improving this experience.
170+
We are working to simplify the triage and review processes for SIG Node.
171+
172+
Specifically, we are working on better test organization, naming,
173+
and tracking:
174+
175+
- https://github.com/kubernetes/enhancements/pull/3042
176+
- https://github.com/kubernetes/test-infra/issues/24641
177+
- [Kubernetes SIG-Node CI Testgrid Tracker](https://docs.google.com/spreadsheets/d/1IwONkeXSc2SG_EQMYGRSkfiSWNk8yWLpVhPm-LOTbGM/edit#gid=0)
178+
179+
We are also constantly making progress on improved tests debuggability and de-flaking.
180+
181+
If any of this interests you, we'd love for you to join us!
182+
There's plenty to learn in debugging test failures, and it will help you gain
183+
familiarity with the code that SIG Node maintains.
184+
185+
You can always find information about the group on the
186+
[SIG Node](https://github.com/kubernetes/community/tree/master/sig-node) page.
187+
We give group updates at our maintainer track sessions, such as
188+
[KubeCon + CloudNativeCon Europe 2021](https://kccnceu2021.sched.com/event/iE8E/kubernetes-sig-node-intro-and-deep-dive-elana-hashman-red-hat-sergey-kanzhelev-google) and
189+
[KubeCon + CloudNative North America 2021](https://kccncna2021.sched.com/event/lV9D/kubenetes-sig-node-intro-and-deep-dive-elana-hashman-derek-carr-red-hat-sergey-kanzhelev-dawn-chen-google?iframe=no&w=100%&sidebar=yes&bg=no).
190+
Join us in our mission to keep the kubelet and other SIG Node components reliable and ensure smooth and uneventful releases!
99 KB
Loading
57.8 KB
Loading

0 commit comments

Comments
 (0)