|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: 'SIG Node CI Subproject Celebrates Two Years of Test Improvements' |
| 4 | +date: 2022-02-16 |
| 5 | +slug: sig-node-ci-subproject-celebrates |
| 6 | +canonicalUrl: https://www.kubernetes.dev/blog/2022/02/16/sig-node-ci-subproject-celebrates-two-years-of-test-improvements/ |
| 7 | +--- |
| 8 | + |
| 9 | +**Authors:** Sergey Kanzhelev (Google), Elana Hashman (Red Hat) |
| 10 | + |
| 11 | +Ensuring the reliability of SIG Node upstream code is a continuous effort |
| 12 | +that takes a lot of behind-the-scenes effort from many contributors. |
| 13 | +There are frequent releases of Kubernetes, base operating systems, |
| 14 | +container runtimes, and test infrastructure that result in a complex matrix that |
| 15 | +requires attention and steady investment to "keep the lights on." |
| 16 | +In May 2020, the Kubernetes node special interest group ("SIG Node") organized a new |
| 17 | +subproject for continuous integration (CI) for node-related code and tests. Since its |
| 18 | +inauguration, the SIG Node CI subproject has run a weekly meeting, and even the full hour |
| 19 | +is often not enough to complete triage of all bugs, test-related PRs and issues, and discuss all |
| 20 | +related ongoing work within the subgroup. |
| 21 | + |
| 22 | +Over the past two years, we've fixed merge-blocking and release-blocking tests, reducing time to merge Kubernetes contributors' pull requests thanks to reduced test flakes. When we started, Node test jobs only passed 42% of the time, and through our efforts, we now ensure a consistent >90% job pass rate. We've closed 144 test failure issues and merged 176 pull requests just in kubernetes/kubernetes. And we've helped subproject participants ascend the Kubernetes contributor ladder, with 3 new org members, 6 new reviewers, and 2 new approvers. |
| 23 | + |
| 24 | +The Node CI subproject is an approachable first stop to help new contributors |
| 25 | +get started with SIG Node. There is a low barrier to entry for new contributors |
| 26 | +to address high-impact bugs and test fixes, although there is a long |
| 27 | +road before contributors can climb the entire contributor ladder: |
| 28 | +it took over a year to establish two new approvers for the group. |
| 29 | +The complexity of all the different components that power Kubernetes nodes |
| 30 | +and its test infrastructure requires a sustained investment over a long period |
| 31 | +for developers to deeply understand the entire system, |
| 32 | +both at high and low levels of detail. |
| 33 | + |
| 34 | +We have several regular contributors at our meetings, however; our reviewers |
| 35 | +and approvers pool is still small. It is our goal to continue to grow |
| 36 | +contributors to ensure a sustainable distribution of work |
| 37 | +that does not just fall to a few key approvers. |
| 38 | + |
| 39 | +It's not always obvious how subprojects within SIGs are formed, operate, |
| 40 | +and work. Each is unique to its sponsoring SIG and tailored to the projects |
| 41 | +that the group is intended to support. As a group that has welcomed many |
| 42 | +first-time SIG Node contributors, we'd like to share some of the details and |
| 43 | +accomplishments over the past two years, |
| 44 | +helping to demystify our inner workings and celebrate the hard work |
| 45 | +of all our dedicated contributors! |
| 46 | + |
| 47 | +## Timeline |
| 48 | + |
| 49 | +***May 2020.*** SIG Node CI group was formed on May 11, 2020, with more than |
| 50 | +[30 volunteers](https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#bookmark=id.vsb8pqnf4gib) |
| 51 | +signed up, to improve SIG Node CI signal and overall observability. |
| 52 | +Victor Pickard focused on getting |
| 53 | +[testgrid jobs](https://testgrid.k8s.io/sig-node) passing |
| 54 | +when Ning Liao suggested forming a group around this effort and came up with |
| 55 | +the [original group charter document](https://docs.google.com/document/d/1yS-XoUl6GjZdjrwxInEZVHhxxLXlTIX2CeWOARmD8tY/edit#heading=h.te6sgum6s8uf). |
| 56 | +The SIG Node chairs sponsored group creation with Victor as a subproject lead. |
| 57 | +Sergey Kanzhelev joined Victor shortly after as a co-lead. |
| 58 | + |
| 59 | +At the kick-off meeting, we discussed which tests to concentrate on fixing first |
| 60 | +and discussed merge-blocking and release-blocking tests, many of which were failing due |
| 61 | +to infrastructure issues or buggy test code. |
| 62 | + |
| 63 | +The subproject launched weekly hour-long meetings to discuss ongoing work |
| 64 | +discussion and triage. |
| 65 | + |
| 66 | +***June 2020.*** Morgan Bauer, Karan Goel, and Jorge Alarcon Ochoa were |
| 67 | +recognized as reviewers for the SIG Node CI group for their contributions, |
| 68 | +helping significantly with the early stages of the subproject. |
| 69 | +David Porter and Roy Yang also joined the SIG test failures GitHub team. |
| 70 | + |
| 71 | +***August 2020.*** All merge-blocking and release-blocking tests were passing, |
| 72 | +with some flakes. However, only 42% of all SIG Node test jobs were green, as there |
| 73 | +were many flakes and failing tests. |
| 74 | + |
| 75 | +***October 2020.*** Amim Knabben becomes a Kubernetes org member for his |
| 76 | +contributions to the subproject. |
| 77 | + |
| 78 | +***January 2021.*** With healthy presubmit and critical periodic jobs passing, |
| 79 | +the subproject discussed its goal for cleaning up the rest of periodic tests |
| 80 | +and ensuring they passed without flakes. |
| 81 | + |
| 82 | +Elana Hashman joined the subproject, stepping up to help lead it after |
| 83 | +Victor's departure. |
| 84 | + |
| 85 | +***February 2021.*** Artyom Lukianov becomes a Kubernetes org member for his |
| 86 | +contributions to the subproject. |
| 87 | + |
| 88 | +***August 2021.*** After SIG Node successfully ran a [bug scrub](https://groups.google.com/g/kubernetes-dev/c/w2ghO4ihje0/m/VeEql1LJBAAJ) |
| 89 | +to clean up its bug backlog, the scope of the meeting was extended to |
| 90 | +include bug triage to increase overall reliability, anticipating issues |
| 91 | +before they affect the CI signal. |
| 92 | + |
| 93 | +Subproject leads Elana Hashman and Sergey Kanzhelev are both recognized as |
| 94 | +approvers on all node test code, supported by SIG Node and SIG Testing. |
| 95 | + |
| 96 | +***September 2021.*** After significant deflaking progress with serial tests in |
| 97 | +the 1.22 release spearheaded by Francesco Romani, the subproject set a goal |
| 98 | +for getting the serial job fully passing by the 1.23 release date. |
| 99 | + |
| 100 | +Mike Miranda becomes a Kubernetes org member for his contributions |
| 101 | +to the subproject. |
| 102 | + |
| 103 | +***November 2021.*** Throughout 2021, SIG Node had no merge or |
| 104 | +release-blocking test failures. Many flaky tests from past releases are removed |
| 105 | +from release-blocking dashboards as they had been fully cleaned up. |
| 106 | + |
| 107 | +Danielle Lancashire was recognized as a reviewer for SIG Node's subgroup, test code. |
| 108 | + |
| 109 | +The final node serial tests were completely fixed. The serial tests consist of |
| 110 | +many disruptive and slow tests which tend to be flakey and are hard |
| 111 | +to troubleshoot. By the 1.23 release freeze, the last serial tests were |
| 112 | +fixed and the job was passing without flakes. |
| 113 | + |
| 114 | +[](https://kubernetes.slack.com/archives/C0BP8PW9G/p1638211041322900) |
| 115 | + |
| 116 | +The 1.23 release got a special shout out for the tests quality and CI signal. |
| 117 | +The SIG Node CI subproject was proud to have helped contribute to such |
| 118 | +a high-quality release, in part due to our efforts in identifying |
| 119 | +and fixing flakes in Node and beyond. |
| 120 | + |
| 121 | +[](https://kubernetes.slack.com/archives/C92G08FGD/p1637175755023200) |
| 122 | + |
| 123 | +***December 2021.*** An estimated 90% of test jobs were passing at the time of |
| 124 | +the 1.23 release (up from 42% in August 2020). |
| 125 | + |
| 126 | +Dockershim code was removed from Kubernetes. This affected nearly half of SIG Node's |
| 127 | +test jobs and the SIG Node CI subproject reacted quickly and retargeted all the |
| 128 | +tests. SIG Node was the first SIG to complete test migrations off dockershim, |
| 129 | +providing examples for other affected SIGs. The vast majority of new jobs passed |
| 130 | +at the time of introduction without further fixes required. The [effort of |
| 131 | +removing dockershim](https://k8s.io/dockershim)) from Kubernetes is ongoing. |
| 132 | +There are still some wrinkles from the dockershim removal as we uncover more |
| 133 | +dependencies on dockershim, but we plan to stabilize all test jobs |
| 134 | +by the 1.24 release. |
| 135 | + |
| 136 | +## Statistics |
| 137 | + |
| 138 | +Our regular meeting attendees and subproject participants for the past few months: |
| 139 | + |
| 140 | +- Aditi Sharma |
| 141 | +- Artyom Lukianov |
| 142 | +- Arnaud Meukam |
| 143 | +- Danielle Lancashire |
| 144 | +- David Porter |
| 145 | +- Davanum Srinivas |
| 146 | +- Elana Hashman |
| 147 | +- Francesco Romani |
| 148 | +- Matthias Bertschy |
| 149 | +- Mike Miranda |
| 150 | +- Paco Xu |
| 151 | +- Peter Hunt |
| 152 | +- Ruiwen Zhao |
| 153 | +- Ryan Phillips |
| 154 | +- Sergey Kanzhelev |
| 155 | +- Skyler Clark |
| 156 | +- Swati Sehgal |
| 157 | +- Wenjun Wu |
| 158 | + |
| 159 | +The [kubernetes/test-infra](https://github.com/kubernetes/test-infra/) source code repository contains test definitions. The number of |
| 160 | +Node PRs just in that repository: |
| 161 | +- 2020 PRs (since May): [183](https://github.com/kubernetes/test-infra/pulls?q=is%3Apr+is%3Aclosed+label%3Asig%2Fnode+created%3A2020-05-01..2020-12-31+-author%3Ak8s-infra-ci-robot+) |
| 162 | +- 2021 PRs: [264](https://github.com/kubernetes/test-infra/pulls?q=is%3Apr+is%3Aclosed+label%3Asig%2Fnode+created%3A2021-01-01..2021-12-31+-author%3Ak8s-infra-ci-robot+) |
| 163 | + |
| 164 | +Triaged issues and PRs on CI board (including triaging away from the subgroup scope): |
| 165 | + |
| 166 | +- 2020 (since May): [132](https://github.com/issues?q=project%3Akubernetes%2F43+created%3A2020-05-01..2020-12-31) |
| 167 | +- 2021: [532](https://github.com/issues?q=project%3Akubernetes%2F43+created%3A2021-01-01..2021-12-31+) |
| 168 | + |
| 169 | +## Future |
| 170 | + |
| 171 | +Just "keeping the lights on" is a bold task and we are committed to improving this experience. |
| 172 | +We are working to simplify the triage and review processes for SIG Node. |
| 173 | + |
| 174 | +Specifically, we are working on better test organization, naming, |
| 175 | +and tracking: |
| 176 | + |
| 177 | +- https://github.com/kubernetes/enhancements/pull/3042 |
| 178 | +- https://github.com/kubernetes/test-infra/issues/24641 |
| 179 | +- [Kubernetes SIG-Node CI Testgrid Tracker](https://docs.google.com/spreadsheets/d/1IwONkeXSc2SG_EQMYGRSkfiSWNk8yWLpVhPm-LOTbGM/edit#gid=0) |
| 180 | + |
| 181 | +We are also constantly making progress on improved tests debuggability and de-flaking. |
| 182 | + |
| 183 | +If any of this interests you, we'd love for you to join us! |
| 184 | +There's plenty to learn in debugging test failures, and it will help you gain |
| 185 | +familiarity with the code that SIG Node maintains. |
| 186 | + |
| 187 | +You can always find information about the group on the |
| 188 | +[SIG Node](https://github.com/kubernetes/community/tree/master/sig-node) page. |
| 189 | +We give group updates at our maintainer track sessions, such as |
| 190 | +[KubeCon + CloudNativeCon Europe 2021](https://kccnceu2021.sched.com/event/iE8E/kubernetes-sig-node-intro-and-deep-dive-elana-hashman-red-hat-sergey-kanzhelev-google) and |
| 191 | +[KubeCon + CloudNative North America 2021](https://kccncna2021.sched.com/event/lV9D/kubenetes-sig-node-intro-and-deep-dive-elana-hashman-derek-carr-red-hat-sergey-kanzhelev-dawn-chen-google?iframe=no&w=100%&sidebar=yes&bg=no). |
| 192 | +Join us in our mission to keep the kubelet and other SIG Node components reliable and ensure smooth and uneventful releases! |
0 commit comments