Skip to content

Commit a50309d

Browse files
RobertKieltythejoycekungalejandrox1
authored
Guide for monitoring tests run by CI - final push (#5449)
* Describe how to find and report failing tests Co-authored-by: Joyce Kung <[email protected]> Co-authored-by: Jorge Alarcon Ochoa <[email protected]> * updates TestGrid screen shot * adds fixes for review comments * adds a generated toc * adds generated toc * removes duplicated toc heading * removes toc comment * adds capitalization feedback * adds grammar and typo fixes * clarifies that this doc is about CI on Kubernetes * fixes a markdown issue * adds fixes for review comments adds a generated toc * adds generated toc removes duplicated toc heading * removes toc comment * adds capitalization feedback * adds grammar and typo fixes * clarifies that this doc is about CI on Kubernetes * fixes a markdown issue * Revert "fixes a markdown issue" This reverts commit 78a8e2f. * Revert "clarifies that this doc is about CI on Kubernetes" This reverts commit 53acc89. * Revert "adds grammar and typo fixes" This reverts commit 4c81b45. * Revert "adds capitalization feedback" This reverts commit b56a446. * Revert "removes toc comment" This reverts commit 1242317. * Revert "adds generated toc" This reverts commit 9725c9e. * Revert "adds fixes for review comments" This reverts commit d9a6347. Co-authored-by: Joyce Kung <[email protected]> Co-authored-by: Jorge Alarcon Ochoa <[email protected]>
1 parent 8bf9787 commit a50309d

11 files changed

+799
-0
lines changed
Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
---
2+
title: "Monitoring Kubernetes Health"
3+
weight: 14
4+
description: |
5+
Guidelines for finding and reporting failing tests in Kubernetes.
6+
---
7+
8+
## Monitoring Kubernetes Health
9+
10+
### Table of Contents
11+
12+
- [Monitoring the health of Kubernetes with TestGrid](#monitoring-the-health-of-kubernetes-with-testgrid)
13+
- [What dashboards should I monitor?](#what-dashboards-should-i-monitor)
14+
- [Test failures that block my Pull Request](#pr-test-failures)
15+
- [What do I do when I see a TestGrid alert?](#what-do-i-do-when-i-see-a-testgrid-alert)
16+
- [Communicate your findings](#communicate-your-findings)
17+
- [Fill out the issue](#fill-out-the-issue)
18+
- [Iterate](#iterate)
19+
20+
## Overview
21+
22+
This document describes the tools used to monitor CI jobs that check the
23+
correctness of changes made to core Kubernetes.
24+
25+
## Monitoring the health of Kubernetes CI Jobs with TestGrid
26+
27+
TestGrid is a highly-configurable, interactive dashboard for viewing your test
28+
results in a grid. TestGrid's back end components are open sourced and can be
29+
viewed in the [TestGrid repo] The front-end code
30+
that renders the dashboard is not currently open sourced.
31+
32+
The Kubernetes community has its own [TestGrid instance] which we use to monitor
33+
and observe the health of the project.
34+
35+
Each Special Interest Group or [SIG] has its own set of dashboards. Each
36+
dashboard is composed of different jobs (build, unit test, integration test,
37+
end-to-end (e2e) test, etc.) These dashboards allow different teams to monitor
38+
and understand how their areas are doing.
39+
40+
End-to-End test (e2e) jobs are in turn made up of test stages (e.g.,
41+
bootstrapping a Kubernetes cluster, tearing down a Kubernetes cluster) and e2e
42+
tests are organized hierarchically per Component and Subcategory within that
43+
component. e.g., the [Kubectl client component tests]
44+
have tests that describe the expected behavior of [Kubectl logs],
45+
one of which is described as [should be able to retrieve and filter logs].
46+
47+
This hierarchy is not currently reflected in TestGrid so a test row will contain
48+
a flattened name which concatenates all of these strings in to a single string.
49+
50+
We highly encourage SIGs to periodically monitor the dashboards related to the
51+
sub-projects that they own. If you see that a job or test has been failing,
52+
please raise an issue with the corresponding SIG in either their mailing list or
53+
in Slack.
54+
55+
In particular, we always welcome the following contributions:
56+
57+
- [Triage Flaky Tests]
58+
- [Fix Flaky Tests]
59+
60+
**Note**: It is important that all SIGs periodically monitor their jobs and
61+
tests. Furthermore, if jobs or tests are failing or flaking, then pull requests
62+
will take a lot longer to be merged. For more information on how flaking tests
63+
disrupt PR merging and how to eliminate them see [Flaky Tests]
64+
65+
### What dashboards should I monitor?
66+
67+
This depends on what areas of Kubernetes you want to contribute to. You should
68+
monitor the dashboards owned by the SIG you are working with.
69+
70+
Additionally, you should check:
71+
72+
- [sig-release-master-blocking]
73+
- [sig-release-master-informing]
74+
75+
since these jobs run tests that are used by SIG Release to determine the overall
76+
quality of Kubernetes and whether or not the commit on master can be considered
77+
suitable for release. Failing tests on a job in sig-release-master-blocking
78+
block a release from taking place.
79+
80+
If your contributions involve code for past releases of kubernetes (e.g.
81+
cherry-picks or backports), we recommend you periodically check on the
82+
*blocking* and *informing* dashboards for [past releases]
83+
84+
---
85+
86+
## Pull request test failures caused by tests unrelated to your change
87+
88+
If a test fails on your Pull Request, and it's clearly not related to the code
89+
your wrote, this presents an opportunity to improve the signal delivered by CI.
90+
91+
Find any open issues that appear related (have the name of the test in them,
92+
describe a similar error, etc.). You can link the open issue in a comment you
93+
use to retrigger jobs, either calling the job out specifically:
94+
95+
```markdown
96+
./test pull-kubernetes-foo
97+
https://github.com/kubernetes/kubernetes/issues/foo
98+
```
99+
100+
or even if just invoking retest
101+
102+
```markdown
103+
./retest
104+
https://github.com/kubernetes/kubernetes/issues/foo
105+
```
106+
107+
(Note the . prefixes are so you don't actually trigger Prow)
108+
109+
You can back-link from the issue to your PR that encountered it, to bump the
110+
issue's last updated date.
111+
112+
When you do this you are adding evidence to support the need to fix the issue by
113+
documenting the pain contributors are experiencing.
114+
115+
## What do I do when I see a TestGrid alert?
116+
117+
If you are part of a SIG's mailing list, occasionally you may see emails from
118+
TestGrid reporting that a job or a test has recently failed.
119+
120+
Alerts are also displayed on the Summary Page of TestGrid dashboards when you
121+
click on the Show All Alerts button at the top of the Summary or Show Alerts
122+
for an individual Job.
123+
124+
However, if you are viewing the summary page of a Testgrid dashboard alerts are
125+
only of secondary interest as the current status of the jobs that are part of
126+
the dashboard are displayed more prominently as follows :
127+
128+
- Passing jobs look like this
129+
<img src="./testgrid-images/testgrid-summary-passing-job.png">
130+
- Flaky jobs like this
131+
<img src="./testgrid-images/testgrid-summary-flaking-job.png">
132+
- Failing job with alert shown
133+
<img src="./testgrid-images/testgrid-summary-failing-job.png">
134+
135+
Taken from [sig-release-master-blocking]
136+
137+
Note the metadata on the right hand side showing job run times, the commit id of
138+
the last green (passing) job run and the time at which the summary page was
139+
loaded (refreshing the browser updates the browser and the update time)
140+
141+
### Communicate your findings
142+
143+
The number one thing to do is to communicate your findings: a test or job has
144+
been flaking or failing. If you saw a TestGrid alert on a mailing list, please
145+
reply to the thread and mention that you are looking into it.
146+
147+
First, check GitHub to see if an issue has already been logged by checking the
148+
following:
149+
150+
- [Issues logged as Flaky Tests - not triaged]
151+
- [Issues logged as Flaky Tests - triaged]
152+
- [CI Signal Board] flaky tests issues segmented by problem resolution workflow.
153+
154+
If an issue has already been opened for the test, you can add any new findings
155+
that are not already documented in the issue.
156+
157+
For example, if a test is flaking intermittently and you have found another
158+
incident where the test has failed that has not been recorded in the issue, then
159+
add the new information to the existing issue.
160+
161+
You can:
162+
163+
- Add a link to the Prow job where the latest test failure has occurred, and
164+
- Note the error message
165+
166+
New evidence is especially useful if the root cause of the problem with the test
167+
has not yet been determined and the issue still has a *needs-triage* label.
168+
169+
If the issue has not already been logged, please [create a new issue] in the
170+
kubernetes/kubernetes repo, and choose the appropriate issue template.
171+
172+
You can jump to create either test issue type using the following links :
173+
174+
- [create a new issue - Failing Test]
175+
- [create a new issue - Flaking Test]
176+
177+
#### Filling out an issue
178+
179+
Both test issue templates are reasonably self-explanatory, what follows are
180+
guidelines and tips on filling out the templates.
181+
182+
When logging a Flaking or Failing test please:
183+
184+
- use plain text when referring to test names and job names. Inconsistent
185+
formatting of names makes it harder to process issues via automation.
186+
- keep an eye out for test names that contain markdown parse-able formatting.
187+
188+
If you are a test maintainer, refrain from including markdown in strings that
189+
are used to name your tests and test components.
190+
191+
#### Fill out the issue for a Flaking Test
192+
193+
1 **Which jobs are flaking**
194+
195+
The example below was taken from the SIG Release dashboard:
196+
197+
<img src="./testgrid-images/testgrid-jobs.png" height="50%" width="100%">
198+
199+
We can see that the following jobs were flaky at the time this screenshot was taken:
200+
201+
- [conformance-ga-only]
202+
- [skew-cluster-latest-kubectl-stable1-gce]
203+
- [gci-gce-ingress]
204+
- [kind-master-parallel]
205+
206+
1. **Which tests are flaking**
207+
208+
Let's grab an example from the SIG release dashboards and look at the
209+
`node-kubelet-master` job in sig-release-master [node-kubelet-master].
210+
211+
<img src="./testgrid-images/test-grid-job.svg" height="70%" width="100%">
212+
213+
Here we see that at 07.19 IST the tests
214+
215+
```text
216+
E2eNode Suite.[sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api [cos-stable2]
217+
kubetest.Node Tests [runner]
218+
```
219+
220+
Failed for Kubernetes commit `d8f9e4587`
221+
The corresponding test-infra commit was `fe9c22dc8`
222+
223+
3. **Since when has it been flaking**
224+
225+
You can get the start time of a flake from the header of the TestGrid page
226+
showing you all the tests. The red numbers in the screen shot above annotate the
227+
grid headings.
228+
229+
They are:
230+
231+
- 1 This row has the times each Prow job was started, each column on the grid
232+
represents a single run of the Prow job
233+
- 2 This row is the Prow job run id number
234+
- 3 This is the kubernetes/kubernetes commit id that was tested
235+
- 4 Theses are the kubernetes/test-infra commit ids that were used to build and
236+
run the Prow Job; kubernetes/test-infra contains CI job definition yaml, builds
237+
for container images used in CI on the Kubernetes project, and also code that
238+
implements a lot of the components used to deliver CI, such as Prow, SpyGlass
239+
and other components.
240+
241+
Click on a cell in the grid to take you to SpyGlass which displays the Prow job
242+
results.
243+
244+
You can also find this data in Triage (see below).
245+
246+
4. **Reason for failure**
247+
248+
Logging an issue brings the flake or failure to the attention of the wider
249+
community, as the issue reporter you do not have to find the reason for failure
250+
right away (nor the solution). You can just log the error reported by the test
251+
when the job was run.
252+
253+
Click on the failed runs (the red cells in the grid) to see the results in
254+
SpyGlass.
255+
256+
For `node-kubelet-master`, we see the following:
257+
258+
![Spyglass Prow Job Results for node-kubelet-master`](./testgrid-images/spyglass-summary-node-kubelet-master.png "Spyglass Prow Job results viewer")
259+
260+
Here we see that 2 tests failed (both related to the node problem detector) and
261+
the `e2e.go: Node Tests` stage was marked as failed (because the node problem
262+
detector tests failed).
263+
264+
You will often see "stages" (steps in an e2e job) as mixed with the tests
265+
themselves. The stages tell you what was going on in the e2e job when an error
266+
occurred.
267+
268+
If we click on the first test error, we will see logs that will (hopefully) help
269+
us figure out why the test failed.
270+
![Spyglass - Prow Job Results viewer](./testgrid-images/spyglass-result.png "Spyglass Prow Job results viewer")
271+
272+
Further down the page you will see all the logs for the entire test run.
273+
Please copy any information you think may be useful from here into the issue.
274+
275+
You can reference a specific line in the logs by click on the line number and
276+
then copying the URL which will now include an anchor to the specific line.
277+
278+
5. **Anything else we need to know**
279+
280+
It is important to review the behavior of the flaking test across a range of
281+
jobs using [Triage].
282+
283+
We can use the Triage tool to see if a test we see failing in a given job has
284+
been failing in others and to understand how jobs are behaving.
285+
286+
For example, we can see how the job we have been looking at has been behaving
287+
recently.
288+
289+
One important detail is that the job names you see on tabs in TestGrid are often
290+
aliases. Job definition details including the job name, the job definition
291+
configuration file and a description of the job can be found below the tab name
292+
in TestGrid with a URL pointing to the yaml file where the job is configured.
293+
294+
For example, when we clicked on a test run for [node-kubelet-master],
295+
the job name, `ci-kubernetes-node-kubelet-features`, can be found at the top left
296+
corner of the Spyglass page (notice the "ci-kubernetes-" prefix).
297+
298+
Then we can run a query on Triage using [ci-kubernetes-node-kubelet-features in
299+
the job field] Note that the Triage query can be bookmarked and can be used as a
300+
deep link that can be added to GitHub issues to assist test maintainers in
301+
understanding what is wrong with a test.
302+
303+
At the time of this writing we saw the following:
304+
305+
<img src="./testgrid-images/triage.png" height="50%" width="100%">
306+
307+
**Note**: notice that you can also improve your query by filtering or excluding
308+
results based on test name or failure text.
309+
310+
Sometimes, Triage will help you find patterns to figure out the root cause of
311+
the problem. In this instance, we can also see that this job has been failing
312+
about 2 times per hour.
313+
314+
### Iterate
315+
316+
Once you have filled out the issue, please mention it in the appropriate mailing
317+
list thread (if you see an email from TestGrid mentioning a job or test failure)
318+
and share it with the appropriate SIG in the Kubernetes Slack.
319+
320+
Don't worry if you are not sure how to debug further or how to resolve the
321+
issue! All issues are unique and require a bit of experience to figure out how
322+
to work on them. For the time being, reach out to people in Slack or the mailing
323+
list.
324+
325+
[TestGrid repo]: https://github.com/GoogleCloudPlatform/testgrid
326+
[TestGrid instance]: https://testgrid.k8s.io/
327+
328+
[SIG]: https://github.com/kubernetes/community/blob/master/sig-list.md
329+
330+
[Kubectl client component tests]: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L229
331+
[Kubectl logs]:https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L1389
332+
[should be able to retrieve and filter logs]:https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L1412
333+
334+
[Triage Flaky Tests]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake
335+
[Fix Flaky Tests]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake+-label%3Aneeds-triage+
336+
337+
[Flaky Tests]:https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md#flaky-tests
338+
339+
[sig-release-master-blocking]:https://testgrid.k8s.io/sig-release-master-blocking
340+
[sig-release-master-informing]:https://testgrid.k8s.io/sig-release-master-informing
341+
342+
[past releases]:https://testgrid.k8s.io/sig-release
343+
344+
[create a new issue]:https://github.com/kubernetes/kubernetes/issues/new/choose
345+
[create a new issue - Failing Test]:https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Ffailing-test&template=failing-test.md
346+
[create a new issue - Flaking Test]:https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Fflake&template=flaking-test.md
347+
348+
[Issues logged as Flaky Tests - not triaged]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake
349+
[Issues logged as Flaky Tests - triaged]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake+-label%3Aneeds-triage+
350+
351+
[CI Signal Board]:https://github.com/orgs/kubernetes/projects/11
352+
353+
[conformance-ga-only]:https://testgrid.k8s.io/sig-release-master-blocking#conformance-ga-only
354+
[skew-cluster-latest-kubectl-stable1-gce]:https://testgrid.k8s.io/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce
355+
[gci-gce-ingress]:https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress
356+
[kind-master-parallel]:https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel
357+
[node-kubelet-master]:https://testgrid.k8s.io/sig-release-master-blocking#node-kubelet-master
358+
359+
[Triage]:https://go.k8s.io/triage
360+
[ci-kubernetes-node-kubelet-features in the job field]:https://go.k8s.io/triage?pr=1&job=ci-kubernetes-node-kubelet-features
395 KB
Loading
721 KB
Loading
185 KB
Loading

0 commit comments

Comments
 (0)