|
| 1 | +# 2020 WG K8s Infra Annual Report |
| 2 | + |
| 3 | +## You and Your Role |
| 4 | + |
| 5 | +**When did you become a chair and do you enjoy the role?** |
| 6 | + |
| 7 | +- **bartsmykla**: February 2020 and I enjoy the role |
| 8 | +- **dims**: along with spiffxp, been there right from the beginning. Lately |
| 9 | + having some conflicts on the meeting time, but definitely enjoying the process |
| 10 | +- **spiffxp**: Was a chair (organizer?) at group’s formation. I enjoy the role |
| 11 | + when I have time to dedicate to it. |
| 12 | + |
| 13 | +**What do you find challenging?** |
| 14 | + |
| 15 | +- **bartsmykla**: As our working group’s efforts are related to multiple SIGs, |
| 16 | + and there is multiple places, tools, repositories which are needed to move |
| 17 | + some things forward I sometimes feel overwhelmed and anxiety about not |
| 18 | + understanding some of the tools (Prow for example), what is also hard is I |
| 19 | + don’t feel I have enough access and knowledge to speed up and move things |
| 20 | + fasters in relation to Prow migration. |
| 21 | +- **dims**: takes too long :) finding/building coalition is hard. Trying hard to |
| 22 | + avoid doing everything by a small set of folks, but not doing too good on that |
| 23 | + front. |
| 24 | +- **spiffxp**: Prioritizing this group’s work, and the work necessary to tend to |
| 25 | + this group’s garden (by that I mean weeding/planting workstreams, |
| 26 | + building/smoothing onramps). Work that usually takes precedence is related to |
| 27 | + company-internal priorities, SIG Testing and kubernetes/kubernetes fires. I |
| 28 | + often find myself unprepared for meetings unless I have been actively pushing |
| 29 | + a specific item in the interim. Very rarely am I sufficiently aware of the |
| 30 | + group’s activity as a whole to effectively drive. |
| 31 | + |
| 32 | +**Do you have goals for the group?** |
| 33 | + |
| 34 | +- **bartsmykla**: My goal would be to improve documentation for people to be |
| 35 | + easier to understand what and where is happening and which tools and resources |
| 36 | + are being used for which efforts |
| 37 | +- **dims**: breaking things up into small chunks that can be easily farmed out |
| 38 | +- **spiffxp**: The thing I care most about is community ownership of |
| 39 | + prow.k8s.io, including on-call. If possible, I would like to see the group’s |
| 40 | + mission through to completion, ensuring that all project infrastructure is |
| 41 | + community-owned and maintained. |
| 42 | + |
| 43 | +**Do you want to continue or find a replacement? If you feel that you aren’t |
| 44 | +ready to pass the baton, what would you like to accomplish before you do?** |
| 45 | + |
| 46 | +- **bartsmykla**: I would like to continue |
| 47 | +- **dims**: Happy to if folks show up who can take over. Always on the look out. |
| 48 | +- **spiffxp**: I personally want to continue, but sometimes wonder if my |
| 49 | + best-effort availability is doing the group a disservice. If there’s a |
| 50 | + replacement and I’m the impediment, I’m happy to step down. The ideal |
| 51 | + replacement or a dedicated TL would have: |
| 52 | + - ability to craft and build consensus on operational policies, lead implementation |
| 53 | + - ability to identify cost hotspots, lead or implement cost-reduction solutions |
| 54 | + - ability to identify security vulnerabilities or operational sharp edges |
| 55 | + (e.g. no backups, easy accidents), lead or implement mitigations |
| 56 | + - familiarity with GCP and Kubernetes |
| 57 | + - ability to document/understand how existing project infra is wired together |
| 58 | + (e.g. could fix https://github.com/kubernetes/test-infra/issues/13063 ) |
| 59 | + |
| 60 | +**Is there something we can provide that would better support you?** |
| 61 | + |
| 62 | +- **bartsmykla**: I can’t think about anything right now |
| 63 | +- **dims**: what spiffxp said! |
| 64 | +- **spiffxp**: TBH I feel like a lot of what we need to make this group as |
| 65 | + active/healthy as I would like needs to come from us. For example I don’t |
| 66 | + think a dedicated PM would help without a dedicated TL. I’m not sure how to |
| 67 | + more effectively motivate our contributing companies to prioritize this work. |
| 68 | + I have pined in the past for dedicated contractors paid by the CNCF for this, |
| 69 | + but I think that could just as easily be fulfilled by contributing companies |
| 70 | + agreeing to staff this. |
| 71 | + |
| 72 | +**Do you have feedback for Steering? Suggestions for what we should work on?** |
| 73 | + |
| 74 | +- **bartsmykla**: I can’t think about anything right now |
| 75 | +- **dims**: yep, talking to CNCF proactively and formulating a plan. |
| 76 | +- **spiffxp**: I think there are three things Steering could help with: |
| 77 | + - Policy guidance from Steering on what is in-scope / out-of-scope for |
| 78 | + Kubernetes’ project-infrastructure budget (e.g. mirroring |
| 79 | + dependency/ecosystem projects like cert-manager [1], ci jobs). It might |
| 80 | + better drive billing requirements, and make it easier/quicker to decide what |
| 81 | + is appropriate to pursue. At the moment we’re using our best judgement, and |
| 82 | + I trust it, but I sometimes feel like we’re flying blind or making stuff up. |
| 83 | + As far as existing spend and budgeting, we don’t have |
| 84 | + quotas/forecasts/alerts; we’re mostly hoping everyone is on their best |
| 85 | + behavior until something seems outsized, at which point it’s case-by-case on |
| 86 | + what to do. |
| 87 | + - I think it would be helpful to get spend on platforms other than Google |
| 88 | + above-the-table, and driven through this group. I know how much money Google |
| 89 | + has provided, and I know where it’s being spent (though not to the |
| 90 | + granularity of per-SIG). I lack the equivalent for other companies “helping |
| 91 | + out” (e.g. AWS, Microsoft, DigitalOcean) |
| 92 | + - This is not a concrete request that can be acted upon now, but I anticipate |
| 93 | + we will want to reduce costs by ensuring that other clouds or large entities |
| 94 | + participate in mirroring Kubernetes artifacts. |
| 95 | + |
| 96 | +## Working Group |
| 97 | + |
| 98 | +**What was the initial mission of the group and if it's changed, how?** |
| 99 | + |
| 100 | +Initial mission was to migrate Kubernetes project infrastructure to the CNCF, |
| 101 | +creation of teams and processes to support ongoing maintenance. |
| 102 | + |
| 103 | +There has been a slight growth in scope in that new infrastructure that |
| 104 | +previously didn't exist is proposed and managed under this group. Examples |
| 105 | +include: |
| 106 | +- binary-artifact-promotion (project only had image-promotion internally, now |
| 107 | + externally, now attempting to expand to binary artifacts) |
| 108 | +- [running triage-party for SIG Release](https://github.com/kubernetes/k8s.io/issues/906) |
| 109 | + (didn't exist until this year) |
| 110 | +- [build infrastructure for windows-based images](https://docs.google.com/document/d/16VBfsFMynA7tObzuZGPpw-sKDKfFc_T5W_E4IeEIaOQ/edit#bookmark=id.3w0g7fo9cp7m) |
| 111 | +- [image vulnerability dashboard](https://docs.google.com/document/d/16VBfsFMynA7tObzuZGPpw-sKDKfFc_T5W_E4IeEIaOQ/edit#bookmark=id.s3by3vki8jer) |
| 112 | + (it's not clear to me whether even google had this internally before) |
| 113 | +- [sharding out / scaling up gitops-based Google Group management](https://docs.google.com/document/d/16VBfsFMynA7tObzuZGPpw-sKDKfFc_T5W_E4IeEIaOQ/edit#bookmark=id.ou5hk544r70m) |
| 114 | + |
| 115 | +**What’s the current roadmap until completion?** |
| 116 | + |
| 117 | +What has been migrated: |
| 118 | +- DNS for kubernetes.io, k8s.io |
| 119 | +- Container images hosted on k8s.gcr.io |
| 120 | +- node-perf-dash.k8s.io |
| 121 | +- perf-dash.k8s.io |
| 122 | +- publishing-bot |
| 123 | +- slack-infra |
| 124 | +- 288 / 1780 prow jobs |
| 125 | +- GCB projects used to create kubernetes/kubernetes releases |
| 126 | + (exception .deb/.rpm packages) |
| 127 | + |
| 128 | +What remains (TODO: we need to update our issues to reflect this) |
| 129 | +- migrate .deb/.rpm package building/hosting to community |
| 130 | + (this would be owned by SIG Release) |
| 131 | + - stop using google-internal tool "rapture" |
| 132 | + - come up with signing keys community agrees to host/trust |
| 133 | + - migrate apt.kubernetes.io to community |
| 134 | +- stop using google-containers GCP project (this would be owned by SIG Release) |
| 135 | + - gs://kubernetes-release, dl.k8s.io |
| 136 | + - [gs://kubernetes-release-dev](https://github.com/kubernetes/k8s.io/issues/846) |
| 137 | +- stop using k8s-prow GCP project (this would be owned by SIG Testing) |
| 138 | + - Prow.k8s.io |
| 139 | + - Ensure community-staffed on-call can support |
| 140 | +- stop using k8s-prow-build GCP project (this would be owned by SIG Testing) |
| 141 | + - 288/1780 jobs migrated out thus far |
| 142 | + - Ensure community-staffed on-call can support |
| 143 | +- [stop using k8s-gubernator GCP project](https://github.com/kubernetes/k8s.io/issues/1308) |
| 144 | + (this would be owned by SIG Testing) |
| 145 | + - migrate/replace gubernator.k8s.io/pr (triage-party?), drop gubernator.k8s.io |
| 146 | + - [migrate kette](https://github.com/kubernetes/k8s.io/issues/787) |
| 147 | + - [migrate k8s-gubernator:builds dataset](https://github.com/kubernetes/k8s.io/issues/1307) |
| 148 | + - [migrate triage.k8s.io](https://github.com/kubernetes/k8s.io/issues/1305) |
| 149 | + - [migrate gs://k8s-metrics](https://github.com/kubernetes/k8s.io/issues/1306) |
| 150 | +- stop using kubernetes-jenkins GCP project (this would be owned by SIG Testing) |
| 151 | + - gs://kubernetes-jenkins (all CI artifacts/logs for prow.k8s.io jobs) |
| 152 | + - sundry other GCS buckets (gs://k8s-kops-gce, gs://kubernetes-staging*) |
| 153 | +- [stop using k8s-federated-conformance GCP project](https://github.com/kubernetes/k8s.io/issues/1311) |
| 154 | + (this would be owned by SIG Testing) |
| 155 | + - Migrate to CNCF-owned k8s-conform (rename/copy sundry GCS buckets, distribute new service account keys) |
| 156 | +- [stop using k8s-testimages GCP project](https://github.com/kubernetes/k8s.io/issues/1312) |
| 157 | + (this could be owned either by SIG Testing or SIG Release) |
| 158 | + - Migrate images used by CI jobs (kubekins, bazel-krte, gcloud, etc.) |
| 159 | + - Migrate test-infra components (kettle, greenhouse, etc.) |
| 160 | + - (This may push us toward [limited/lifecycle-based retention of images, which |
| 161 | + GCR does not natively have](https://github.com/kubernetes/k8s.io/issues/525)?) |
| 162 | +- stop using kubernetes-site GCP project (unsure, maybe SIG ContribEx or SIG Docs depending) |
| 163 | + - ??? |
| 164 | +- Ensure SIG ownership of all infra and services |
| 165 | + - Must be supportable by non-google community members |
| 166 | + - Ensure critical contributor user journeys are well documented for each service |
| 167 | + |
| 168 | +**Have you produced any artifacts, reports, white papers to date?** |
| 169 | + |
| 170 | +We provide a [publicly viewable billing report](https://datastudio.google.com/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e) |
| 171 | +accessible to members of [email protected]. |
| 172 | +The project was given $3M/yr for 3 years, and our third year started ~August 2020. |
| 173 | +Our spend over the past 28 days has been ~$109K, which works out to ~$1.42M/yr. |
| 174 | +A very rough breakdown of the $109k: |
| 175 | +- $74k - k8s-artifacts-prod* (~ k8s.gcr.io) |
| 176 | +- $34k - k8s-infra-prow*, k8s-infra-e2e*, k8s-staging* (~ project CI thus far, follows kubernetes/kubernetes traffic) |
| 177 | +- $0.7k - kubernetes-public (~ everything else) |
| 178 | + |
| 179 | +**Is everything in your readme accurate? posting meetings on youtube?** |
| 180 | + |
| 181 | +Our community |
| 182 | +[readme](https://github.com/kubernetes/community/tree/master/wg-k8s-infra) is |
| 183 | +accurate if sparse. The |
| 184 | +[readme](https://github.com/kubernetes/k8s.io/blob/master/README.md) in k8s.io, |
| 185 | +which houses most of the actual infrastructure, is terse and slightly out of |
| 186 | +date (missing triage party) |
| 187 | + |
| 188 | +[We are having problems with our zoom automation](https://github.com/kubernetes/community/issues/5199), |
| 189 | +causing [our youtube playlist](https://www.youtube.com/playlist?list=PL69nYSiGNLP2Ghq7VW8rFbMFoHwvORuDL) |
| 190 | +to fall out of date; I noticed while writing this report and have gotten help |
| 191 | +backfilling. We're currently missing 2020-10-14. |
| 192 | + |
| 193 | +**Do you have regular check-ins with your sponsoring SIGs?** |
| 194 | + |
| 195 | +No formal reporting in either direction. Meetings/slack/issues see active |
| 196 | +participation from @spiffxp (SIG Testing chair), and occasional participation |
| 197 | +from @justaugustus (SIG Release) and @nikhita (SIG Contributor Experience). We |
| 198 | +also see participation on slack/issues/PRs from @dims (SIG Architecture) who has |
| 199 | +a schedule conflict. |
0 commit comments