Skip to content

Container Image Promoter: The State of Things Today

Linus Arver edited this page May 7, 2020 · 10 revisions

UPDATE

Please refer to https://github.com/kubernetes/k8s.io/blob/master/k8s.gcr.io/Vanity-Domain-Flip.md for the latest updates.

Overview

There are 2 promoter systems today:

  1. An internal promotion process promotes (copies) images from gcr.io/k8s-image-staging to k8s.gcr.io (which redirects to gcr.io/google-containers)
  2. An OSS promotion process promotes images from various staging repos to gcr.io/k8s-artifacts-prod

Both processes work in a declarative manner: the promoter reads from a manifest file (YAML) describing which images should be copied over, and the promoter performs the copy. Modifications to the promoter manifest is done internally for the internal promoter, and publicly for the OSS one (on Github), via pull requests.

The goal is to shut down the internal promotion process which requires Googler approvals, and replace it with the OSS promotion process. As part of this switch, all of the images in gcr.io/google-containers would have to be migrated over to gcr.io/k8s-artifacts-prod.

Life of a Promoter Manifest PR

  1. When a PR against the OSS promoter manifests is created, a pull-k8sio-cip Prow job runs and performs a dry run to detect what new images will be promoted.
  2. When the PR is merged, the post-k8sio-cip job runs to perform the actual promotion (not dry run).

As a sanity check, a ci-k8sio-cip job runs daily. This job is like post-k8sio-cip, except that it runs every day regardless of whether there was a PR merge.

Remaining Tasks

There are 2 areas that need to be addressed before the switch from the internal to the OSS promoter can be done:

  1. backups
  2. auditing

These are addressed below.

Backups

Design doc: https://docs.google.com/document/d/1od5y-Z2xP9mVmg2Yztnv-GQ7D-orj9HsTmeVvNHkzzA/edit#heading=h.fvtprlh40nl2 (Must be a member of [email protected] to see/comment).

The key idea is to store backups prefixed by timestamp into a single backup GCR (named gcr.io/k8s-artifacts-prod-bak), and to let GCR do the de-duping of data (to make the backup operations fast and to also reduce storage costs).

It was decided that the backups should be very simple. To this end, this PR was created (and merged!). It runs every hour today here.

The remaining items to do are:

  1. get https://github.com/kubernetes/test-infra/pull/15398 merged
  2. add a basic test to ensure that backups are actually being created (we want to avoid a situation where the hourly backups exit with status 0, but no backups actually get created)

For the 2nd item, we can probably either use cip's snapshotting or gcrane's ls subcommand to get a listing of all items under a particular backup timestamp and check that the list is nonzero or at least some number bigger than the previous snapshot. Something along these lines (to m

Auditing

Design doc: https://docs.google.com/document/d/1LElgTfYB8ZdTsbmESk7Q1lhLaiHVXENwoEL-MHaFwNQ/edit?usp=sharing (Must be a member of [email protected] to see/comment).

The key idea is to listen to all changes to GCR as they occur. At a high level, this auditing process would:

  1. listen to Cloud Pub/Sub events emitted by gcr.io/k8s-artifacts-prod for state changes (new images, modified images, etc)
  2. audit the state change (Pub/Sub message) against the promoter manifests
  3. alert humans (Slack or some other paging mechanism) if there is a problem

It was decided that this auditing mechanism would only alert humans and not try to resolve issues on its own (i.e., it is read-only as far as GCR is concerned).

There is a PR open for this here. After this initial implementation is merged, 2 more things need to happen:

  1. create Prow presubmit jobs that actually test the auditing code using the full suite of GCP resources on a fake production project (GCR, Pub/Sub, Cloud Run deployment of the auditing image)
  2. deploy the auditing image to production

Other

For additional context, check the REAME.md: (https://github.com/kubernetes-sigs/k8s-container-image-promoter#container-image-promoter)

The e2e tests for the promoter itself run in the following projects:

  • k8s-cip-test-prod
  • k8s-staging-cip-test
Clone this wiki locally