Skip to content

Commit cc0ddcf

Browse files
committed
Add Flake Finder Fridays episode 0 show notes
Adds show notes for the first episode of flake finder fridays. Signed-off-by: hasheddan <[email protected]>
1 parent 1a62c1b commit cc0ddcf

File tree

1 file changed

+141
-0
lines changed
  • contributors/devel/sig-release/flake-finders/episodes/000

1 file changed

+141
-0
lines changed
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Flake Finder Fridays #0
2+
3+
February 5th 2021 ([Recording](https://youtu.be/Hqlm2h2AEvA))
4+
5+
## Introduction
6+
7+
This is the first episode of Flake Finder Fridays with Dan Mangum and Rob
8+
Kielty.
9+
10+
On the first friday of every month we will go through an issue that was logged
11+
for a failing or flaking test on the Kubernetes project.
12+
13+
We will review the triage, root cause analysis, and problem resolution for a
14+
test related issue logged in the past four weeks.
15+
16+
We intend to demo how CI works on the Kubernetes project and also how we
17+
collaborate across teams to resolve test maintenance issues.
18+
19+
## Issue
20+
This is the issue that we are going to look at today ...
21+
22+
[[Failing Test] ci-kubernetes-build-canary does not understand
23+
"--platform"](https://github.com/kubernetes/kubernetes/issues/98646)
24+
25+
### Testgrid Dashboard
26+
[build-master-canary](https://testgrid.k8s.io/sig-release-master-informing#build-master-canary)
27+
28+
### Breaking PRs
29+
- [Use buildx in favor of `FROM --platform` syntax
30+
](https://github.com/kubernetes/kubernetes/pull/98529)
31+
- [Switch to `docker buildx` for conformance
32+
image](https://github.com/kubernetes/kubernetes/pull/98569)
33+
34+
## Investigation
35+
36+
1. Desire to move from Google-owned infrastructure to Kubernetes community
37+
infrastructure. Thus the introduction of a **canary** build job to test
38+
pushing building and pushing artifacts with new infrastructure.
39+
1. Desire to move off of `bootstrap.py` job (currently being used for canary
40+
job) to `krel` tooling.
41+
1. Separate job existed (`ci-kubernetes-build-no-bootstrap`) that was doing the
42+
same thing as the canary job, but with `krel` tooling.
43+
1. The `no-bootstrap` job was running smoothly, so [updated to use it for the
44+
canary job](https://github.com/kubernetes/test-infra/pull/20663).
45+
1. Right before the update, we [switched to using buildx for multi-arch
46+
images](https://github.com/kubernetes/kubernetes/pull/98529).
47+
1. Job started failing, which showed up in [some interesting
48+
ways](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700).
49+
1. Triage begins! Issue
50+
[opened](https://github.com/kubernetes/kubernetes/issues/98646) and release
51+
management team is pinged in Slack.
52+
1. The `build-master`
53+
[job](https://testgrid.k8s.io/sig-release-master-blocking#build-master) was
54+
still passing though... interesting.
55+
1. Both are eventually calling `make release`, so environment must be different.
56+
1. Let's look inside!
57+
58+
```
59+
docker run -it --entrypoint /bin/bash gcr.io/k8s-testimages/bootstrap:v20210130-12516b2
60+
```
61+
62+
```
63+
docker run -it gcr.io/k8s-staging-releng/k8s-ci-builder:v20201128-v0.6.0-6-g6313f696-default /bin/bash
64+
```
65+
66+
1. A few directions we could go here:
67+
1. Update the `k8s-ci-builder` image to you use newer version of Docker
68+
1. Update the `k8s-ci-builder` image to ensure that
69+
`DOCKER_CLI_EXPERIMENTAL=enabled` is set
70+
1. Update the `release.sh` script to set `DOCKER_CLI_EXPERIMENTAL=enabled`
71+
72+
1. Making the `release.sh` script more flexible serves the community better
73+
because it allows for building with more environments. Would also be good to
74+
update the `k8s-ci-builder` image for this specific case as well.
75+
1. And we get a new
76+
[failure](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-build-canary/1356704759045689344/build-log.txt)!
77+
1. Let's see what is going on in those images again...
78+
1. Why would this cause an error in one but not the other if we have
79+
`DOCKER_CLI_EXPERIMENTAL=enabled`?
80+
([this](https://github.com/docker/buildx/pull/403) is why)
81+
1. In the mean time we went ahead and [re-enabled the bootstrap
82+
job](https://github.com/kubernetes/test-infra/pull/20712) (consumers of those
83+
images need them!)
84+
1. Decided to [increase logging
85+
verbosity](https://github.com/kubernetes/kubernetes/pull/98568) on failures
86+
to see if that would give us a clue into what was going wrong (and to remove
87+
those annoying `quiet currently not implemented` warnings).
88+
1. Job turns green! But how?
89+
1. [Buildx](https://github.com/docker/buildx) is versioned separately than
90+
Docker itself. Turns out that the `--quiet` flag warning was [actually an
91+
error](https://github.com/docker/buildx/pull/403) until `v0.5.1` of Buildx.
92+
1. The `build-master` job was running with buildx `v0.5.1` while the `krel` job
93+
was running with `v0.4.2`. This meant the quiet flag was causing an error in
94+
the `krel` job, and removing it alleviated the error.
95+
1. Finished up by once again [removing the `bootstrap`
96+
job](https://github.com/kubernetes/test-infra/pull/20731).
97+
98+
### Fixes
99+
100+
- [Set DOCKER_CLI_EXPERIMENTAL=enabled for images using
101+
buildx](https://github.com/kubernetes/kubernetes/pull/98672)
102+
- [Make image build logs verbose if
103+
necessary](https://github.com/kubernetes/kubernetes/pull/98568)
104+
105+
### Test Infra
106+
107+
- [ci-kubernetes-build-canary: Migrate from bootstrap to
108+
krel](https://github.com/kubernetes/test-infra/pull/20663)
109+
- [releng: Re-enable a bootstrap build job for K8s
110+
Infra](https://github.com/kubernetes/test-infra/pull/20712)
111+
- [Revert "releng: Re-enable a bootstrap build job for K8s
112+
Infra"](https://github.com/kubernetes/test-infra/pull/20731)
113+
114+
### Slack Threads
115+
116+
- [kubeadm failing with
117+
ci/latest](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700)
118+
119+
### Helpful Links
120+
121+
- [Docker Buildx
122+
Documentation](https://docs.docker.com/buildx/working-with-buildx/)
123+
- [What is Docker Buildkit and What can I use it
124+
for?](https://brianchristner.io/what-is-docker-buildkit/)
125+
- [Buildx --quiet error](https://github.com/docker/buildx/pull/403)
126+
127+
## Kubernetes Project Resources
128+
129+
Brand new to the project?
130+
- Start here: https://www.kubernetes.dev/
131+
132+
Setup already and interested in maintaining tests?
133+
- Check out [this video](https://www.youtube.com/watch?v=Ewp8LNY_qTg) from
134+
Jordan Liggit who describes strategies and tactics to deflake flaking tests
135+
([Jordan's show notes for that
136+
talk](https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0))
137+
138+
Here's how the CI Signal Team actively monitors CI during a release cycle:
139+
- [A Tour of CI on the Kubernetes
140+
Project](https://www.youtube.com/watch?v=bttEcArAjUw)
141+
- [Show notes](bit.ly/k8s-ci)

0 commit comments

Comments
 (0)