Skip to content

Commit e60da5b

Browse files
authored
Merge pull request #31655 from SergeyKanzhelev/ci-group
SIG Node CI Subproject celebrates two years of test improvements
2 parents e3d0384 + 1c0b4eb commit e60da5b

File tree

3 files changed

+192
-0
lines changed

3 files changed

+192
-0
lines changed
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
---
2+
layout: blog
3+
title: 'SIG Node CI Subproject Celebrates Two Years of Test Improvements'
4+
date: 2022-02-16
5+
slug: sig-node-ci-subproject-celebrates
6+
canonicalUrl: https://www.kubernetes.dev/blog/2022/02/16/sig-node-ci-subproject-celebrates-two-years-of-test-improvements/
7+
---
8+
9+
**Authors:** Sergey Kanzhelev (Google), Elana Hashman (Red Hat)
10+
11+
Ensuring the reliability of SIG Node upstream code is a continuous effort
12+
that takes a lot of behind-the-scenes effort from many contributors.
13+
There are frequent releases of Kubernetes, base operating systems,
14+
container runtimes, and test infrastructure that result in a complex matrix that
15+
requires attention and steady investment to "keep the lights on."
16+
In May 2020, the Kubernetes node special interest group ("SIG Node") organized a new
17+
subproject for continuous integration (CI) for node-related code and tests. Since its
18+
inauguration, the SIG Node CI subproject has run a weekly meeting, and even the full hour
19+
is often not enough to complete triage of all bugs, test-related PRs and issues, and discuss all
20+
related ongoing work within the subgroup.
21+
22+
Over the past two years, we've fixed merge-blocking and release-blocking tests, reducing time to merge Kubernetes contributors' pull requests thanks to reduced test flakes. When we started, Node test jobs only passed 42% of the time, and through our efforts, we now ensure a consistent >90% job pass rate. We've closed 144 test failure issues and merged 176 pull requests just in kubernetes/kubernetes. And we've helped subproject participants ascend the Kubernetes contributor ladder, with 3 new org members, 6 new reviewers, and 2 new approvers.
23+
24+
The Node CI subproject is an approachable first stop to help new contributors
25+
get started with SIG Node. There is a low barrier to entry for new contributors
26+
to address high-impact bugs and test fixes, although there is a long
27+
road before contributors can climb the entire contributor ladder:
28+
it took over a year to establish two new approvers for the group.
29+
The complexity of all the different components that power Kubernetes nodes
30+
and its test infrastructure requires a sustained investment over a long period
31+
for developers to deeply understand the entire system,
32+
both at high and low levels of detail.
33+
34+
We have several regular contributors at our meetings, however; our reviewers
35+
and approvers pool is still small. It is our goal to continue to grow
36+
contributors to ensure a sustainable distribution of work
37+
that does not just fall to a few key approvers.
38+
39+
It's not always obvious how subprojects within SIGs are formed, operate,
40+
and work. Each is unique to its sponsoring SIG and tailored to the projects
41+
that the group is intended to support. As a group that has welcomed many
42+
first-time SIG Node contributors, we'd like to share some of the details and
43+
accomplishments over the past two years,
44+
helping to demystify our inner workings and celebrate the hard work
45+
of all our dedicated contributors!
46+
47+
## Timeline
48+
49+
***May 2020.*** SIG Node CI group was formed on May 11, 2020, with more than
50+
[30 volunteers](https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#bookmark=id.vsb8pqnf4gib)
51+
signed up, to improve SIG Node CI signal and overall observability.
52+
Victor Pickard focused on getting
53+
[testgrid jobs](https://testgrid.k8s.io/sig-node) passing
54+
when Ning Liao suggested forming a group around this effort and came up with
55+
the [original group charter document](https://docs.google.com/document/d/1yS-XoUl6GjZdjrwxInEZVHhxxLXlTIX2CeWOARmD8tY/edit#heading=h.te6sgum6s8uf).
56+
The SIG Node chairs sponsored group creation with Victor as a subproject lead.
57+
Sergey Kanzhelev joined Victor shortly after as a co-lead.
58+
59+
At the kick-off meeting, we discussed which tests to concentrate on fixing first
60+
and discussed merge-blocking and release-blocking tests, many of which were failing due
61+
to infrastructure issues or buggy test code.
62+
63+
The subproject launched weekly hour-long meetings to discuss ongoing work
64+
discussion and triage.
65+
66+
***June 2020.*** Morgan Bauer, Karan Goel, and Jorge Alarcon Ochoa were
67+
recognized as reviewers for the SIG Node CI group for their contributions,
68+
helping significantly with the early stages of the subproject.
69+
David Porter and Roy Yang also joined the SIG test failures GitHub team.
70+
71+
***August 2020.*** All merge-blocking and release-blocking tests were passing,
72+
with some flakes. However, only 42% of all SIG Node test jobs were green, as there
73+
were many flakes and failing tests.
74+
75+
***October 2020.*** Amim Knabben becomes a Kubernetes org member for his
76+
contributions to the subproject.
77+
78+
***January 2021.*** With healthy presubmit and critical periodic jobs passing,
79+
the subproject discussed its goal for cleaning up the rest of periodic tests
80+
and ensuring they passed without flakes.
81+
82+
Elana Hashman joined the subproject, stepping up to help lead it after
83+
Victor's departure.
84+
85+
***February 2021.*** Artyom Lukianov becomes a Kubernetes org member for his
86+
contributions to the subproject.
87+
88+
***August 2021.*** After SIG Node successfully ran a [bug scrub](https://groups.google.com/g/kubernetes-dev/c/w2ghO4ihje0/m/VeEql1LJBAAJ)
89+
to clean up its bug backlog, the scope of the meeting was extended to
90+
include bug triage to increase overall reliability, anticipating issues
91+
before they affect the CI signal.
92+
93+
Subproject leads Elana Hashman and Sergey Kanzhelev are both recognized as
94+
approvers on all node test code, supported by SIG Node and SIG Testing.
95+
96+
***September 2021.*** After significant deflaking progress with serial tests in
97+
the 1.22 release spearheaded by Francesco Romani, the subproject set a goal
98+
for getting the serial job fully passing by the 1.23 release date.
99+
100+
Mike Miranda becomes a Kubernetes org member for his contributions
101+
to the subproject.
102+
103+
***November 2021.*** Throughout 2021, SIG Node had no merge or
104+
release-blocking test failures. Many flaky tests from past releases are removed
105+
from release-blocking dashboards as they had been fully cleaned up.
106+
107+
Danielle Lancashire was recognized as a reviewer for SIG Node's subgroup, test code.
108+
109+
The final node serial tests were completely fixed. The serial tests consist of
110+
many disruptive and slow tests which tend to be flakey and are hard
111+
to troubleshoot. By the 1.23 release freeze, the last serial tests were
112+
fixed and the job was passing without flakes.
113+
114+
[![Slack announcement that Serial tests are green](serial-tests-green.png)](https://kubernetes.slack.com/archives/C0BP8PW9G/p1638211041322900)
115+
116+
The 1.23 release got a special shout out for the tests quality and CI signal.
117+
The SIG Node CI subproject was proud to have helped contribute to such
118+
a high-quality release, in part due to our efforts in identifying
119+
and fixing flakes in Node and beyond.
120+
121+
[![Slack shoutout that release was mostly green](release-mostly-green.png)](https://kubernetes.slack.com/archives/C92G08FGD/p1637175755023200)
122+
123+
***December 2021.*** An estimated 90% of test jobs were passing at the time of
124+
the 1.23 release (up from 42% in August 2020).
125+
126+
Dockershim code was removed from Kubernetes. This affected nearly half of SIG Node's
127+
test jobs and the SIG Node CI subproject reacted quickly and retargeted all the
128+
tests. SIG Node was the first SIG to complete test migrations off dockershim,
129+
providing examples for other affected SIGs. The vast majority of new jobs passed
130+
at the time of introduction without further fixes required. The [effort of
131+
removing dockershim](https://k8s.io/dockershim)) from Kubernetes is ongoing.
132+
There are still some wrinkles from the dockershim removal as we uncover more
133+
dependencies on dockershim, but we plan to stabilize all test jobs
134+
by the 1.24 release.
135+
136+
## Statistics
137+
138+
Our regular meeting attendees and subproject participants for the past few months:
139+
140+
- Aditi Sharma
141+
- Artyom Lukianov
142+
- Arnaud Meukam
143+
- Danielle Lancashire
144+
- David Porter
145+
- Davanum Srinivas
146+
- Elana Hashman
147+
- Francesco Romani
148+
- Matthias Bertschy
149+
- Mike Miranda
150+
- Paco Xu
151+
- Peter Hunt
152+
- Ruiwen Zhao
153+
- Ryan Phillips
154+
- Sergey Kanzhelev
155+
- Skyler Clark
156+
- Swati Sehgal
157+
- Wenjun Wu
158+
159+
The [kubernetes/test-infra](https://github.com/kubernetes/test-infra/) source code repository contains test definitions. The number of
160+
Node PRs just in that repository:
161+
- 2020 PRs (since May): [183](https://github.com/kubernetes/test-infra/pulls?q=is%3Apr+is%3Aclosed+label%3Asig%2Fnode+created%3A2020-05-01..2020-12-31+-author%3Ak8s-infra-ci-robot+)
162+
- 2021 PRs: [264](https://github.com/kubernetes/test-infra/pulls?q=is%3Apr+is%3Aclosed+label%3Asig%2Fnode+created%3A2021-01-01..2021-12-31+-author%3Ak8s-infra-ci-robot+)
163+
164+
Triaged issues and PRs on CI board (including triaging away from the subgroup scope):
165+
166+
- 2020 (since May): [132](https://github.com/issues?q=project%3Akubernetes%2F43+created%3A2020-05-01..2020-12-31)
167+
- 2021: [532](https://github.com/issues?q=project%3Akubernetes%2F43+created%3A2021-01-01..2021-12-31+)
168+
169+
## Future
170+
171+
Just "keeping the lights on" is a bold task and we are committed to improving this experience.
172+
We are working to simplify the triage and review processes for SIG Node.
173+
174+
Specifically, we are working on better test organization, naming,
175+
and tracking:
176+
177+
- https://github.com/kubernetes/enhancements/pull/3042
178+
- https://github.com/kubernetes/test-infra/issues/24641
179+
- [Kubernetes SIG-Node CI Testgrid Tracker](https://docs.google.com/spreadsheets/d/1IwONkeXSc2SG_EQMYGRSkfiSWNk8yWLpVhPm-LOTbGM/edit#gid=0)
180+
181+
We are also constantly making progress on improved tests debuggability and de-flaking.
182+
183+
If any of this interests you, we'd love for you to join us!
184+
There's plenty to learn in debugging test failures, and it will help you gain
185+
familiarity with the code that SIG Node maintains.
186+
187+
You can always find information about the group on the
188+
[SIG Node](https://github.com/kubernetes/community/tree/master/sig-node) page.
189+
We give group updates at our maintainer track sessions, such as
190+
[KubeCon + CloudNativeCon Europe 2021](https://kccnceu2021.sched.com/event/iE8E/kubernetes-sig-node-intro-and-deep-dive-elana-hashman-red-hat-sergey-kanzhelev-google) and
191+
[KubeCon + CloudNative North America 2021](https://kccncna2021.sched.com/event/lV9D/kubenetes-sig-node-intro-and-deep-dive-elana-hashman-derek-carr-red-hat-sergey-kanzhelev-dawn-chen-google?iframe=no&w=100%&sidebar=yes&bg=no).
192+
Join us in our mission to keep the kubelet and other SIG Node components reliable and ensure smooth and uneventful releases!
99 KB
Loading
57.8 KB
Loading

0 commit comments

Comments
 (0)