|
| 1 | +--- |
| 2 | +title: "Monitoring Kubernetes Health" |
| 3 | +weight: 14 |
| 4 | +description: | |
| 5 | + Guidelines for finding and reporting failing tests in Kubernetes. |
| 6 | +--- |
| 7 | + |
| 8 | +## Monitoring Kubernetes Health |
| 9 | + |
| 10 | +### Table of Contents |
| 11 | + |
| 12 | +- [Monitoring the health of Kubernetes with TestGrid](#monitoring-the-health-of-kubernetes-with-testgrid) |
| 13 | +- [What dashboards should I monitor?](#what-dashboards-should-i-monitor) |
| 14 | +- [Test failures that block my Pull Request](#pr-test-failures) |
| 15 | +- [What do I do when I see a TestGrid alert?](#what-do-i-do-when-i-see-a-testgrid-alert) |
| 16 | +- [Communicate your findings](#communicate-your-findings) |
| 17 | +- [Fill out the issue](#fill-out-the-issue) |
| 18 | +- [Iterate](#iterate) |
| 19 | + |
| 20 | +## Overview |
| 21 | + |
| 22 | +This document describes the tools used to monitor CI jobs that check the |
| 23 | +correctness of changes made to core Kubernetes. |
| 24 | + |
| 25 | +## Monitoring the health of Kubernetes CI Jobs with TestGrid |
| 26 | + |
| 27 | +TestGrid is a highly-configurable, interactive dashboard for viewing your test |
| 28 | +results in a grid. TestGrid's back end components are open sourced and can be |
| 29 | +viewed in the [TestGrid repo] The front-end code |
| 30 | +that renders the dashboard is not currently open sourced. |
| 31 | + |
| 32 | +The Kubernetes community has its own [TestGrid instance] which we use to monitor |
| 33 | +and observe the health of the project. |
| 34 | + |
| 35 | +Each Special Interest Group or [SIG] has its own set of dashboards. Each |
| 36 | +dashboard is composed of different jobs (build, unit test, integration test, |
| 37 | +end-to-end (e2e) test, etc.) These dashboards allow different teams to monitor |
| 38 | +and understand how their areas are doing. |
| 39 | + |
| 40 | +End-to-End test (e2e) jobs are in turn made up of test stages (e.g., |
| 41 | +bootstrapping a Kubernetes cluster, tearing down a Kubernetes cluster) and e2e |
| 42 | +tests are organized hierarchically per Component and Subcategory within that |
| 43 | +component. e.g., the [Kubectl client component tests] |
| 44 | + have tests that describe the expected behavior of [Kubectl logs], |
| 45 | +one of which is described as [should be able to retrieve and filter logs]. |
| 46 | + |
| 47 | +This hierarchy is not currently reflected in TestGrid so a test row will contain |
| 48 | +a flattened name which concatenates all of these strings in to a single string. |
| 49 | + |
| 50 | +We highly encourage SIGs to periodically monitor the dashboards related to the |
| 51 | +sub-projects that they own. If you see that a job or test has been failing, |
| 52 | +please raise an issue with the corresponding SIG in either their mailing list or |
| 53 | +in Slack. |
| 54 | + |
| 55 | +In particular, we always welcome the following contributions: |
| 56 | + |
| 57 | +- [Triage Flaky Tests] |
| 58 | +- [Fix Flaky Tests] |
| 59 | + |
| 60 | +**Note**: It is important that all SIGs periodically monitor their jobs and |
| 61 | +tests. Furthermore, if jobs or tests are failing or flaking, then pull requests |
| 62 | +will take a lot longer to be merged. For more information on how flaking tests |
| 63 | +disrupt PR merging and how to eliminate them see [Flaky Tests] |
| 64 | + |
| 65 | +### What dashboards should I monitor? |
| 66 | + |
| 67 | +This depends on what areas of Kubernetes you want to contribute to. You should |
| 68 | +monitor the dashboards owned by the SIG you are working with. |
| 69 | + |
| 70 | +Additionally, you should check: |
| 71 | + |
| 72 | +- [sig-release-master-blocking] |
| 73 | +- [sig-release-master-informing] |
| 74 | + |
| 75 | +since these jobs run tests that are used by SIG Release to determine the overall |
| 76 | +quality of Kubernetes and whether or not the commit on master can be considered |
| 77 | +suitable for release. Failing tests on a job in sig-release-master-blocking |
| 78 | +block a release from taking place. |
| 79 | + |
| 80 | +If your contributions involve code for past releases of kubernetes (e.g. |
| 81 | +cherry-picks or backports), we recommend you periodically check on the |
| 82 | +*blocking* and *informing* dashboards for [past releases] |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## Pull request test failures caused by tests unrelated to your change |
| 87 | + |
| 88 | +If a test fails on your Pull Request, and it's clearly not related to the code |
| 89 | +your wrote, this presents an opportunity to improve the signal delivered by CI. |
| 90 | + |
| 91 | +Find any open issues that appear related (have the name of the test in them, |
| 92 | +describe a similar error, etc.). You can link the open issue in a comment you |
| 93 | +use to retrigger jobs, either calling the job out specifically: |
| 94 | + |
| 95 | +```markdown |
| 96 | +./test pull-kubernetes-foo |
| 97 | +https://github.com/kubernetes/kubernetes/issues/foo |
| 98 | +``` |
| 99 | + |
| 100 | +or even if just invoking retest |
| 101 | + |
| 102 | +```markdown |
| 103 | +./retest |
| 104 | +https://github.com/kubernetes/kubernetes/issues/foo |
| 105 | +``` |
| 106 | + |
| 107 | +(Note the . prefixes are so you don't actually trigger Prow) |
| 108 | + |
| 109 | +You can back-link from the issue to your PR that encountered it, to bump the |
| 110 | +issue's last updated date. |
| 111 | + |
| 112 | +When you do this you are adding evidence to support the need to fix the issue by |
| 113 | +documenting the pain contributors are experiencing. |
| 114 | + |
| 115 | +## What do I do when I see a TestGrid alert? |
| 116 | + |
| 117 | +If you are part of a SIG's mailing list, occasionally you may see emails from |
| 118 | +TestGrid reporting that a job or a test has recently failed. |
| 119 | + |
| 120 | +Alerts are also displayed on the Summary Page of TestGrid dashboards when you |
| 121 | +click on the Show All Alerts button at the top of the Summary or Show Alerts |
| 122 | +for an individual Job. |
| 123 | + |
| 124 | +However, if you are viewing the summary page of a Testgrid dashboard alerts are |
| 125 | +only of secondary interest as the current status of the jobs that are part of |
| 126 | +the dashboard are displayed more prominently as follows : |
| 127 | + |
| 128 | +- Passing jobs look like this |
| 129 | +<img src="./testgrid-images/testgrid-summary-passing-job.png"> |
| 130 | +- Flaky jobs like this |
| 131 | +<img src="./testgrid-images/testgrid-summary-flaking-job.png"> |
| 132 | +- Failing job with alert shown |
| 133 | +<img src="./testgrid-images/testgrid-summary-failing-job.png"> |
| 134 | + |
| 135 | +Taken from [sig-release-master-blocking] |
| 136 | + |
| 137 | +Note the metadata on the right hand side showing job run times, the commit id of |
| 138 | +the last green (passing) job run and the time at which the summary page was |
| 139 | +loaded (refreshing the browser updates the browser and the update time) |
| 140 | + |
| 141 | +### Communicate your findings |
| 142 | + |
| 143 | +The number one thing to do is to communicate your findings: a test or job has |
| 144 | +been flaking or failing. If you saw a TestGrid alert on a mailing list, please |
| 145 | +reply to the thread and mention that you are looking into it. |
| 146 | + |
| 147 | +First, check GitHub to see if an issue has already been logged by checking the |
| 148 | +following: |
| 149 | + |
| 150 | +- [Issues logged as Flaky Tests - not triaged] |
| 151 | +- [Issues logged as Flaky Tests - triaged] |
| 152 | +- [CI Signal Board] flaky tests issues segmented by problem resolution workflow. |
| 153 | + |
| 154 | +If an issue has already been opened for the test, you can add any new findings |
| 155 | +that are not already documented in the issue. |
| 156 | + |
| 157 | +For example, if a test is flaking intermittently and you have found another |
| 158 | +incident where the test has failed that has not been recorded in the issue, then |
| 159 | +add the new information to the existing issue. |
| 160 | + |
| 161 | +You can: |
| 162 | + |
| 163 | +- Add a link to the Prow job where the latest test failure has occurred, and |
| 164 | +- Note the error message |
| 165 | + |
| 166 | +New evidence is especially useful if the root cause of the problem with the test |
| 167 | +has not yet been determined and the issue still has a *needs-triage* label. |
| 168 | + |
| 169 | +If the issue has not already been logged, please [create a new issue] in the |
| 170 | +kubernetes/kubernetes repo, and choose the appropriate issue template. |
| 171 | + |
| 172 | +You can jump to create either test issue type using the following links : |
| 173 | + |
| 174 | +- [create a new issue - Failing Test] |
| 175 | +- [create a new issue - Flaking Test] |
| 176 | + |
| 177 | +#### Filling out an issue |
| 178 | + |
| 179 | +Both test issue templates are reasonably self-explanatory, what follows are |
| 180 | +guidelines and tips on filling out the templates. |
| 181 | + |
| 182 | +When logging a Flaking or Failing test please: |
| 183 | + |
| 184 | +- use plain text when referring to test names and job names. Inconsistent |
| 185 | + formatting of names makes it harder to process issues via automation. |
| 186 | +- keep an eye out for test names that contain markdown parse-able formatting. |
| 187 | + |
| 188 | +If you are a test maintainer, refrain from including markdown in strings that |
| 189 | +are used to name your tests and test components. |
| 190 | + |
| 191 | +#### Fill out the issue for a Flaking Test |
| 192 | + |
| 193 | +1 **Which jobs are flaking** |
| 194 | + |
| 195 | +The example below was taken from the SIG Release dashboard: |
| 196 | + |
| 197 | +<img src="./testgrid-images/testgrid-jobs.png" height="50%" width="100%"> |
| 198 | + |
| 199 | +We can see that the following jobs were flaky at the time this screenshot was taken: |
| 200 | + |
| 201 | +- [conformance-ga-only] |
| 202 | +- [skew-cluster-latest-kubectl-stable1-gce] |
| 203 | +- [gci-gce-ingress] |
| 204 | +- [kind-master-parallel] |
| 205 | + |
| 206 | +1. **Which tests are flaking** |
| 207 | + |
| 208 | +Let's grab an example from the SIG release dashboards and look at the |
| 209 | +`node-kubelet-master` job in sig-release-master [node-kubelet-master]. |
| 210 | + |
| 211 | +<img src="./testgrid-images/test-grid-job.svg" height="70%" width="100%"> |
| 212 | + |
| 213 | +Here we see that at 07.19 IST the tests |
| 214 | + |
| 215 | +```text |
| 216 | +E2eNode Suite.[sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api [cos-stable2] |
| 217 | +kubetest.Node Tests [runner] |
| 218 | +``` |
| 219 | + |
| 220 | +Failed for Kubernetes commit `d8f9e4587` |
| 221 | +The corresponding test-infra commit was `fe9c22dc8` |
| 222 | + |
| 223 | +3. **Since when has it been flaking** |
| 224 | + |
| 225 | +You can get the start time of a flake from the header of the TestGrid page |
| 226 | +showing you all the tests. The red numbers in the screen shot above annotate the |
| 227 | +grid headings. |
| 228 | + |
| 229 | +They are: |
| 230 | + |
| 231 | +- 1 This row has the times each Prow job was started, each column on the grid |
| 232 | + represents a single run of the Prow job |
| 233 | +- 2 This row is the Prow job run id number |
| 234 | +- 3 This is the kubernetes/kubernetes commit id that was tested |
| 235 | +- 4 Theses are the kubernetes/test-infra commit ids that were used to build and |
| 236 | +run the Prow Job; kubernetes/test-infra contains CI job definition yaml, builds |
| 237 | +for container images used in CI on the Kubernetes project, and also code that |
| 238 | +implements a lot of the components used to deliver CI, such as Prow, SpyGlass |
| 239 | +and other components. |
| 240 | + |
| 241 | +Click on a cell in the grid to take you to SpyGlass which displays the Prow job |
| 242 | +results. |
| 243 | + |
| 244 | +You can also find this data in Triage (see below). |
| 245 | + |
| 246 | +4. **Reason for failure** |
| 247 | + |
| 248 | +Logging an issue brings the flake or failure to the attention of the wider |
| 249 | +community, as the issue reporter you do not have to find the reason for failure |
| 250 | +right away (nor the solution). You can just log the error reported by the test |
| 251 | +when the job was run. |
| 252 | + |
| 253 | +Click on the failed runs (the red cells in the grid) to see the results in |
| 254 | +SpyGlass. |
| 255 | + |
| 256 | +For `node-kubelet-master`, we see the following: |
| 257 | + |
| 258 | + |
| 259 | + |
| 260 | +Here we see that 2 tests failed (both related to the node problem detector) and |
| 261 | +the `e2e.go: Node Tests` stage was marked as failed (because the node problem |
| 262 | +detector tests failed). |
| 263 | + |
| 264 | +You will often see "stages" (steps in an e2e job) as mixed with the tests |
| 265 | +themselves. The stages tell you what was going on in the e2e job when an error |
| 266 | +occurred. |
| 267 | + |
| 268 | +If we click on the first test error, we will see logs that will (hopefully) help |
| 269 | +us figure out why the test failed. |
| 270 | + |
| 271 | + |
| 272 | +Further down the page you will see all the logs for the entire test run. |
| 273 | +Please copy any information you think may be useful from here into the issue. |
| 274 | + |
| 275 | +You can reference a specific line in the logs by click on the line number and |
| 276 | +then copying the URL which will now include an anchor to the specific line. |
| 277 | + |
| 278 | +5. **Anything else we need to know** |
| 279 | + |
| 280 | +It is important to review the behavior of the flaking test across a range of |
| 281 | +jobs using [Triage]. |
| 282 | + |
| 283 | +We can use the Triage tool to see if a test we see failing in a given job has |
| 284 | +been failing in others and to understand how jobs are behaving. |
| 285 | + |
| 286 | +For example, we can see how the job we have been looking at has been behaving |
| 287 | +recently. |
| 288 | + |
| 289 | +One important detail is that the job names you see on tabs in TestGrid are often |
| 290 | +aliases. Job definition details including the job name, the job definition |
| 291 | +configuration file and a description of the job can be found below the tab name |
| 292 | +in TestGrid with a URL pointing to the yaml file where the job is configured. |
| 293 | + |
| 294 | +For example, when we clicked on a test run for [node-kubelet-master], |
| 295 | +the job name, `ci-kubernetes-node-kubelet-features`, can be found at the top left |
| 296 | +corner of the Spyglass page (notice the "ci-kubernetes-" prefix). |
| 297 | + |
| 298 | +Then we can run a query on Triage using [ci-kubernetes-node-kubelet-features in |
| 299 | +the job field] Note that the Triage query can be bookmarked and can be used as a |
| 300 | +deep link that can be added to GitHub issues to assist test maintainers in |
| 301 | +understanding what is wrong with a test. |
| 302 | + |
| 303 | +At the time of this writing we saw the following: |
| 304 | + |
| 305 | +<img src="./testgrid-images/triage.png" height="50%" width="100%"> |
| 306 | + |
| 307 | +**Note**: notice that you can also improve your query by filtering or excluding |
| 308 | +results based on test name or failure text. |
| 309 | + |
| 310 | +Sometimes, Triage will help you find patterns to figure out the root cause of |
| 311 | +the problem. In this instance, we can also see that this job has been failing |
| 312 | +about 2 times per hour. |
| 313 | + |
| 314 | +### Iterate |
| 315 | + |
| 316 | +Once you have filled out the issue, please mention it in the appropriate mailing |
| 317 | +list thread (if you see an email from TestGrid mentioning a job or test failure) |
| 318 | +and share it with the appropriate SIG in the Kubernetes Slack. |
| 319 | + |
| 320 | +Don't worry if you are not sure how to debug further or how to resolve the |
| 321 | +issue! All issues are unique and require a bit of experience to figure out how |
| 322 | +to work on them. For the time being, reach out to people in Slack or the mailing |
| 323 | +list. |
| 324 | + |
| 325 | +[TestGrid repo]: https://github.com/GoogleCloudPlatform/testgrid |
| 326 | +[TestGrid instance]: https://testgrid.k8s.io/ |
| 327 | + |
| 328 | +[SIG]: https://github.com/kubernetes/community/blob/master/sig-list.md |
| 329 | + |
| 330 | +[Kubectl client component tests]: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L229 |
| 331 | +[Kubectl logs]:https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L1389 |
| 332 | +[should be able to retrieve and filter logs]:https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L1412 |
| 333 | + |
| 334 | +[Triage Flaky Tests]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake |
| 335 | +[Fix Flaky Tests]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake+-label%3Aneeds-triage+ |
| 336 | + |
| 337 | +[Flaky Tests]:https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md#flaky-tests |
| 338 | + |
| 339 | +[sig-release-master-blocking]:https://testgrid.k8s.io/sig-release-master-blocking |
| 340 | +[sig-release-master-informing]:https://testgrid.k8s.io/sig-release-master-informing |
| 341 | + |
| 342 | +[past releases]:https://testgrid.k8s.io/sig-release |
| 343 | + |
| 344 | +[create a new issue]:https://github.com/kubernetes/kubernetes/issues/new/choose |
| 345 | +[create a new issue - Failing Test]:https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Ffailing-test&template=failing-test.md |
| 346 | +[create a new issue - Flaking Test]:https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Fflake&template=flaking-test.md |
| 347 | + |
| 348 | +[Issues logged as Flaky Tests - not triaged]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake |
| 349 | +[Issues logged as Flaky Tests - triaged]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake+-label%3Aneeds-triage+ |
| 350 | + |
| 351 | +[CI Signal Board]:https://github.com/orgs/kubernetes/projects/11 |
| 352 | + |
| 353 | +[conformance-ga-only]:https://testgrid.k8s.io/sig-release-master-blocking#conformance-ga-only |
| 354 | +[skew-cluster-latest-kubectl-stable1-gce]:https://testgrid.k8s.io/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce |
| 355 | +[gci-gce-ingress]:https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress |
| 356 | +[kind-master-parallel]:https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel |
| 357 | +[node-kubelet-master]:https://testgrid.k8s.io/sig-release-master-blocking#node-kubelet-master |
| 358 | + |
| 359 | +[Triage]:https://go.k8s.io/triage |
| 360 | +[ci-kubernetes-node-kubelet-features in the job field]:https://go.k8s.io/triage?pr=1&job=ci-kubernetes-node-kubelet-features |
0 commit comments