2
2
3
3
February 5th 2021 ([ Recording] ( https://youtu.be/Hqlm2h2AEvA ) )
4
4
5
+ Hosts: [ Dan Mangum] ( https://github.com/hasheddan ) , [ Rob
6
+ Kielty] ( https://github.com/RobertKielty )
7
+
5
8
## Introduction
6
9
7
10
This is the first episode of Flake Finder Fridays with Dan Mangum and Rob
@@ -16,13 +19,12 @@ test related issue logged in the past four weeks.
16
19
We intend to demo how CI works on the Kubernetes project and also how we
17
20
collaborate across teams to resolve test maintenance issues.
18
21
19
- ## Issue
20
- This is the issue that we are going to look at today ...
22
+ ## Issue This is the issue that we are going to look at today ...
21
23
22
24
[[ Failing Test] ci-kubernetes-build-canary does not understand
23
25
"--platform"] ( https://github.com/kubernetes/kubernetes/issues/98646 )
24
26
25
- ### Testgrid Dashboard
27
+ ### Testgrid Dashboard
26
28
[ build-master-canary] ( https://testgrid.k8s.io/sig-release-master-informing#build-master-canary )
27
29
28
30
### Breaking PRs
@@ -34,66 +36,65 @@ This is the issue that we are going to look at today ...
34
36
## Investigation
35
37
36
38
1 . Desire to move from Google-owned infrastructure to Kubernetes community
37
- infrastructure. Thus the introduction of a ** canary** build job to test
38
- pushing building and pushing artifacts with new infrastructure.
39
+ infrastructure. Thus the introduction of a ** canary** build job to test pushing
40
+ building and pushing artifacts with new infrastructure.
39
41
1 . Desire to move off of ` bootstrap.py ` job (currently being used for canary
40
- job) to ` krel ` tooling.
42
+ job) to ` krel ` tooling.
41
43
1 . Separate job existed (` ci-kubernetes-build-no-bootstrap ` ) that was doing the
42
- same thing as the canary job, but with ` krel ` tooling.
44
+ same thing as the canary job, but with ` krel ` tooling.
43
45
1 . The ` no-bootstrap ` job was running smoothly, so [ updated to use it for the
44
- canary job] ( https://github.com/kubernetes/test-infra/pull/20663 ) .
46
+ canary job] ( https://github.com/kubernetes/test-infra/pull/20663 ) .
45
47
1 . Right before the update, we [ switched to using buildx for multi-arch
46
- images] ( https://github.com/kubernetes/kubernetes/pull/98529 ) .
48
+ images] ( https://github.com/kubernetes/kubernetes/pull/98529 ) .
47
49
1 . Job started failing, which showed up in [ some interesting
48
- ways] ( https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700 ) .
50
+ ways] ( https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700 ) .
49
51
1 . Triage begins! Issue
50
- [ opened] ( https://github.com/kubernetes/kubernetes/issues/98646 ) and release
51
- management team is pinged in Slack.
52
+ [ opened] ( https://github.com/kubernetes/kubernetes/issues/98646 ) and release
53
+ management team is pinged in Slack.
52
54
1 . The ` build-master `
53
- [ job] ( https://testgrid.k8s.io/sig-release-master-blocking#build-master ) was
54
- still passing though... interesting.
55
+ [ job] ( https://testgrid.k8s.io/sig-release-master-blocking#build-master ) was
56
+ still passing though... interesting.
55
57
1 . Both are eventually calling ` make release ` , so environment must be different.
56
58
1 . Let's look inside!
57
59
58
- ```
59
- docker run -it --entrypoint /bin/bash gcr.io/k8s-testimages/bootstrap:v20210130-12516b2
60
- ```
60
+ ``` docker run -it --entrypoint /bin/bash
61
+ gcr.io/k8s-testimages/bootstrap:v20210130-12516b2 ```
61
62
62
- ```
63
- docker run -it gcr.io/k8s-staging-releng/k8s-ci-builder:v20201128-v0.6.0-6-g6313f696-default /bin/bash
64
- ```
63
+ ``` docker run -it
64
+ gcr.io/k8s-staging-releng/k8s-ci-builder:v20201128-v0.6.0-6-g6313f696-default
65
+ /bin/bash ```
65
66
66
67
1. A few directions we could go here:
67
68
1. Update the `k8s-ci-builder` image to you use newer version of Docker
68
69
1. Update the `k8s-ci-builder` image to ensure that
69
- `DOCKER_CLI_EXPERIMENTAL=enabled` is set
70
+ `DOCKER_CLI_EXPERIMENTAL=enabled` is set
70
71
1. Update the `release.sh` script to set `DOCKER_CLI_EXPERIMENTAL=enabled`
71
72
72
73
1. Making the `release.sh` script more flexible serves the community better
73
- because it allows for building with more environments. Would also be good to
74
- update the `k8s-ci-builder` image for this specific case as well.
74
+ because it allows for building with more environments. Would also be good to
75
+ update the `k8s-ci-builder` image for this specific case as well.
75
76
1. And we get a new
76
- [failure](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-build-canary/1356704759045689344/build-log.txt)!
77
+ [failure](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-build-canary/1356704759045689344/build-log.txt)!
77
78
1. Let's see what is going on in those images again...
78
79
1. Why would this cause an error in one but not the other if we have
79
- `DOCKER_CLI_EXPERIMENTAL=enabled`?
80
- ([this](https://github.com/docker/buildx/pull/403) is why)
80
+ `DOCKER_CLI_EXPERIMENTAL=enabled`?
81
+ ([this](https://github.com/docker/buildx/pull/403) is why)
81
82
1. In the mean time we went ahead and [re-enabled the bootstrap
82
- job](https://github.com/kubernetes/test-infra/pull/20712) (consumers of those
83
- images need them!)
83
+ job](https://github.com/kubernetes/test-infra/pull/20712) (consumers of those
84
+ images need them!)
84
85
1. Decided to [increase logging
85
- verbosity](https://github.com/kubernetes/kubernetes/pull/98568) on failures
86
- to see if that would give us a clue into what was going wrong (and to remove
87
- those annoying `quiet currently not implemented` warnings).
86
+ verbosity](https://github.com/kubernetes/kubernetes/pull/98568) on failures to
87
+ see if that would give us a clue into what was going wrong (and to remove those
88
+ annoying `quiet currently not implemented` warnings).
88
89
1. Job turns green! But how?
89
90
1. [Buildx](https://github.com/docker/buildx) is versioned separately than
90
- Docker itself. Turns out that the `--quiet` flag warning was [actually an
91
- error](https://github.com/docker/buildx/pull/403) until `v0.5.1` of Buildx.
91
+ Docker itself. Turns out that the `--quiet` flag warning was [actually an
92
+ error](https://github.com/docker/buildx/pull/403) until `v0.5.1` of Buildx.
92
93
1. The `build-master` job was running with buildx `v0.5.1` while the `krel` job
93
- was running with `v0.4.2`. This meant the quiet flag was causing an error in
94
- the `krel` job, and removing it alleviated the error.
94
+ was running with `v0.4.2`. This meant the quiet flag was causing an error in the
95
+ `krel` job, and removing it alleviated the error.
95
96
1. Finished up by once again [removing the `bootstrap`
96
- job](https://github.com/kubernetes/test-infra/pull/20731).
97
+ job](https://github.com/kubernetes/test-infra/pull/20731).
97
98
98
99
### Fixes
99
100
@@ -132,8 +133,8 @@ Brand new to the project?
132
133
Setup already and interested in maintaining tests?
133
134
- Check out [this video](https://www.youtube.com/watch?v=Ewp8LNY_qTg) from
134
135
Jordan Liggit who describes strategies and tactics to deflake flaking tests
135
- ([Jordan's show notes for that
136
- talk](https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0))
136
+ ([Jordan's show notes for that
137
+ talk](https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0))
137
138
138
139
Here's how the CI Signal Team actively monitors CI during a release cycle:
139
140
- [A Tour of CI on the Kubernetes
0 commit comments