Skip to content

Commit 1bbde30

Browse files
authored
Merge pull request #5474 from hasheddan/fff
Add Flake Finder Fridays assets
2 parents be1b03d + 9321dd5 commit 1bbde30

File tree

2 files changed

+147
-0
lines changed

2 files changed

+147
-0
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Flake Finder Fridays
2+
3+
Flake Finder Fridays is a monthly livestream show where we explore recent test
4+
failures and flakes, walk through how they were resolved, and share tips and
5+
tricks for troubleshooting Kubernetes CI issues.
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Flake Finder Fridays #0
2+
3+
February 5th 2021 ([Recording](https://youtu.be/Hqlm2h2AEvA))
4+
5+
Hosts: [Dan Mangum](https://github.com/hasheddan), [Rob
6+
Kielty](https://github.com/RobertKielty)
7+
8+
## Introduction
9+
10+
This is the first episode of Flake Finder Fridays with Dan Mangum and Rob
11+
Kielty.
12+
13+
On the first friday of every month we will go through an issue that was logged
14+
for a failing or flaking test on the Kubernetes project.
15+
16+
We will review the triage, root cause analysis, and problem resolution for a
17+
test related issue logged in the past four weeks.
18+
19+
We intend to demo how CI works on the Kubernetes project and also how we
20+
collaborate across teams to resolve test maintenance issues.
21+
22+
## Issue This is the issue that we are going to look at today ...
23+
24+
[[Failing Test] ci-kubernetes-build-canary does not understand
25+
"--platform"](https://github.com/kubernetes/kubernetes/issues/98646)
26+
27+
### Testgrid Dashboard
28+
[build-master-canary](https://testgrid.k8s.io/sig-release-master-informing#build-master-canary)
29+
30+
### Breaking PRs
31+
- [Use buildx in favor of `FROM --platform` syntax
32+
](https://github.com/kubernetes/kubernetes/pull/98529)
33+
- [Switch to `docker buildx` for conformance
34+
image](https://github.com/kubernetes/kubernetes/pull/98569)
35+
36+
## Investigation
37+
38+
1. Desire to move from Google-owned infrastructure to Kubernetes community
39+
infrastructure. Thus the introduction of a **canary** build job to test pushing
40+
building and pushing artifacts with new infrastructure.
41+
1. Desire to move off of `bootstrap.py` job (currently being used for canary
42+
job) to `krel` tooling.
43+
1. Separate job existed (`ci-kubernetes-build-no-bootstrap`) that was doing the
44+
same thing as the canary job, but with `krel` tooling.
45+
1. The `no-bootstrap` job was running smoothly, so [updated to use it for the
46+
canary job](https://github.com/kubernetes/test-infra/pull/20663).
47+
1. Right before the update, we [switched to using buildx for multi-arch
48+
images](https://github.com/kubernetes/kubernetes/pull/98529).
49+
1. Job started failing, which showed up in [some interesting
50+
ways](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700).
51+
1. Triage begins! Issue
52+
[opened](https://github.com/kubernetes/kubernetes/issues/98646) and release
53+
management team is pinged in Slack.
54+
1. The `build-master`
55+
[job](https://testgrid.k8s.io/sig-release-master-blocking#build-master) was
56+
still passing though... interesting.
57+
1. Both are eventually calling `make release`, so environment must be different.
58+
1. Let's look inside!
59+
60+
``` docker run -it --entrypoint /bin/bash
61+
gcr.io/k8s-testimages/bootstrap:v20210130-12516b2 ```
62+
63+
``` docker run -it
64+
gcr.io/k8s-staging-releng/k8s-ci-builder:v20201128-v0.6.0-6-g6313f696-default
65+
/bin/bash ```
66+
67+
1. A few directions we could go here:
68+
1. Update the `k8s-ci-builder` image to you use newer version of Docker
69+
1. Update the `k8s-ci-builder` image to ensure that
70+
`DOCKER_CLI_EXPERIMENTAL=enabled` is set
71+
1. Update the `release.sh` script to set `DOCKER_CLI_EXPERIMENTAL=enabled`
72+
73+
1. Making the `release.sh` script more flexible serves the community better
74+
because it allows for building with more environments. Would also be good to
75+
update the `k8s-ci-builder` image for this specific case as well.
76+
1. And we get a new
77+
[failure](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-build-canary/1356704759045689344/build-log.txt)!
78+
1. Let's see what is going on in those images again...
79+
1. Why would this cause an error in one but not the other if we have
80+
`DOCKER_CLI_EXPERIMENTAL=enabled`?
81+
([this](https://github.com/docker/buildx/pull/403) is why)
82+
1. In the mean time we went ahead and [re-enabled the bootstrap
83+
job](https://github.com/kubernetes/test-infra/pull/20712) (consumers of those
84+
images need them!)
85+
1. Decided to [increase logging
86+
verbosity](https://github.com/kubernetes/kubernetes/pull/98568) on failures to
87+
see if that would give us a clue into what was going wrong (and to remove those
88+
annoying `quiet currently not implemented` warnings).
89+
1. Job turns green! But how?
90+
1. [Buildx](https://github.com/docker/buildx) is versioned separately than
91+
Docker itself. Turns out that the `--quiet` flag warning was [actually an
92+
error](https://github.com/docker/buildx/pull/403) until `v0.5.1` of Buildx.
93+
1. The `build-master` job was running with buildx `v0.5.1` while the `krel` job
94+
was running with `v0.4.2`. This meant the quiet flag was causing an error in the
95+
`krel` job, and removing it alleviated the error.
96+
1. Finished up by once again [removing the `bootstrap`
97+
job](https://github.com/kubernetes/test-infra/pull/20731).
98+
99+
### Fixes
100+
101+
- [Set DOCKER_CLI_EXPERIMENTAL=enabled for images using
102+
buildx](https://github.com/kubernetes/kubernetes/pull/98672)
103+
- [Make image build logs verbose if
104+
necessary](https://github.com/kubernetes/kubernetes/pull/98568)
105+
106+
### Test Infra
107+
108+
- [ci-kubernetes-build-canary: Migrate from bootstrap to
109+
krel](https://github.com/kubernetes/test-infra/pull/20663)
110+
- [releng: Re-enable a bootstrap build job for K8s
111+
Infra](https://github.com/kubernetes/test-infra/pull/20712)
112+
- [Revert "releng: Re-enable a bootstrap build job for K8s
113+
Infra"](https://github.com/kubernetes/test-infra/pull/20731)
114+
115+
### Slack Threads
116+
117+
- [kubeadm failing with
118+
ci/latest](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700)
119+
120+
### Helpful Links
121+
122+
- [Docker Buildx
123+
Documentation](https://docs.docker.com/buildx/working-with-buildx/)
124+
- [What is Docker Buildkit and What can I use it
125+
for?](https://brianchristner.io/what-is-docker-buildkit/)
126+
- [Buildx --quiet error](https://github.com/docker/buildx/pull/403)
127+
128+
## Kubernetes Project Resources
129+
130+
Brand new to the project?
131+
- Start here: https://www.kubernetes.dev/
132+
133+
Setup already and interested in maintaining tests?
134+
- Check out [this video](https://www.youtube.com/watch?v=Ewp8LNY_qTg) from
135+
Jordan Liggit who describes strategies and tactics to deflake flaking tests
136+
([Jordan's show notes for that
137+
talk](https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0))
138+
139+
Here's how the CI Signal Team actively monitors CI during a release cycle:
140+
- [A Tour of CI on the Kubernetes
141+
Project](https://www.youtube.com/watch?v=bttEcArAjUw)
142+
- [Show notes](bit.ly/k8s-ci)

0 commit comments

Comments
 (0)