Skip to content

[CI] Add design document for post submit testing #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions premerge/post-submit-testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Post Submit Testing

## Introduction

While this infrastructure is focused on premerge testing, it is also important
to make sure that the specific configuration we are testing is tested post
commit as well. This document outlines the motivation for the need to test this
configuration post commit, why we are utilizing this design over others, and
how we plan on implementing this to ensure we get fast feedback scalably.

## Background/Motivation

It is important that we test the premerge configuration postcommit as well.
This enables easily checking the state of `main` at any given time and also
easily pinpointing where exactly `main` has broken which enables figuring
out which commit to revert/fix forward to get everything back to green.
Having information on the state of `main` is also important for certain kinds
of automation, like the planned premerge testing advisor that will let
contributors know if tests failing in their PR are already failing at `main`
and that it should be safe to merge despite the given failures.

Originally, we were looking at running postcommit testing through Github
Actions as well. This is primarily due to it being easy (a single line
change in the Github Actions workflow config) and also easy integration
with the Github API for implementation of the premerge testing advisor.
More detailed motivation for the doing postcommit testing directly through
Github is available in the [discourse RFC thread](https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124)
where we proposed doing this. We eventually decided against implementation in
this way for a couple of reasons:

1. Nonstandard - The standard postcommit testing infrastructure for LLVM is
through Buildbot. Doing postcommit testing for the premerge configuration
through Github would represent a significant deparature from this. This means
we are leaving behind some common infrastructure and are also forcing a new
unfamiliar postcommit interface on LLVM contributors.
2. Notifications - This is the biggest issue. Github currently gives very
little control over the notifications that are sent out when the build
fails or gets cancelled. This is specifically a problem with Github sending
out notifications for build failures even if the previous build has failed.
This can easily create a lot of warning fatigue which is something we are
putting a lot of effort in to avoid so that the premerge system is
percieved as reliable, people trust its results, and most importantly,
people pay attention to failures when they do occur because they are
caused by the patch getting the notification and are actionable.
3. Customization - Buildbot can be customized around issues like notifications
whereas Github cannot. Github is not particularly responsive on feature
requests and their notification story has been poor for a while, so their
lack of customization is a strategic risk.

## Design

The overall design involves using an annotated builder that will be deployed
on both the central and west clusters for a HA configuration. The annotated
builder will consist of a script that runs builds inside kubernetes pods
to enable autoscaling.

In terms of the full flow, a commit pushed to the LLVM monorepo will get
detected by the buildbot master. The Buildbot master will invoke Buildbot
workers running on our clusters. These Buildbot workers will use annotated
builders to launch a build wrapped in a kubernetes pod and report the results
back to the buildbot master. When the job is finished, the pod will complete
and capacity will be available for another build, or if there is nothing
left to test GKE will see that there is nothing running on one of the nodes
and downscale the node pool.

### Annotated Builder

llvm-zorg has multiple types of builders. We plan on using an AnnotatedBuilder.
We need the flexibility of the AnnotatedBuilder (essentially a custom python
file that runs the build) to deploy jobs on the cluster. AnnotatedBuilder based
builders also enable deploying changes without needing to restart the buildbot
master. Without this, we have to wait for an administrator of the LLVM buildbot
master to restart it before our changes get deployed. This could significantly
delay updates or responses to incidents, especially before the system is fully
stable and we need to modify it relatively frequently.

### Build Distribution

We want to be able to take advantage of the autoscaling functionality of the
new cluster to efficiently utilize resources. To do this, we plan on having the
AnnotatedBuilder script launch builds as kubernetes pods. This allows for
kubernetes to assign the builds to nodes and also allows autoscaling through
the same mechanism that Github autoscales. This allows for us to quickly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Github autoscales" => "Github uses to autoscale".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "This allows...also allows...This allows..." Might read better if you didn't use 'this allows" so frequently in such a short space. (Some alternatives: "enables", "lets us")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think you mean the same mechanism that LLVM testing on GitHub uses, not GitHub the company specifically uses it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified to Github ARC so that it's clear it's the software running on our cluster.

process builds at peak times and not pay for extra capacity when commit
traffic is quiet. This is essentially for ensuring our resource use is
efficient while still providing fast feedback.

Using the kubernetes API inside of a python script to launch builds does add
some complexity. However, we do not believe we need too much added
complexity to achieve our goal here and this allows for vastly more efficient
resource usage.

### Testing Configuration

The testing configuration will be as close to the premerge configuration as
possible. We will be running all tests inside the same container with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence. "The actual testing configuration will be the same as that used by the premerge testing."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using testing configuration to include the set of projects that we're testing (which will be different) and the environment.

The environment should be as close as possible, but I don't think there's any easy way to guarantee that it's the same.

same scripts (the `monolithic-linux.sh` and `monolithic-windows.sh` scripts).
However, there will be one main difference between the premerge and postcommit
configuration. In the postcommit configuration we propose testing all projects
on every commit rather than only testing the projects that themselves changed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 107 should become:

"testing. In the postcommit testing, we propose testing all projects"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the text slightly to make it a bit more clear.

I believe you were trying to get at replacing configuration with testing? I think configuration is more accurate here and more consistent with the rest of the paragraph (especially after adding the definition at the beginning).

I can change it up if it is inhibiting clarity though.

or had dependencies that changed. We propose this for two main reasons.
Firstly, Buildbot does not have support for heterogenous build configurations.
This means that testing a different set of projects in a single build
configuration depending upon what files changed could easily produce many
more notifications if certain configurations were failing and some were
passing which defeats the point of using Buildbot in the first place. For
example, if there is a MLIR change that fails, an unrelated clang-tidy change
that passes all tests that lands afterwards, and then another MLIR change, a
notification will also be sent out on the second MLIR change because the
clang-tidy change turned the build back to green. We also explicitly do not
test certain projects even though their dependencies change, and while we do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe MLIR example needs clarification: 1). You need to make it clear that you're talking about postsubmit testing only. 2). You need to explain more about turning red/green -- that reach postsubmit run will mark all tests/builds as either red or green, and THAT is why you need to run all the tests every time: to avoid passing in one part from incorrectly marking failures in a different part as passing when they're not.

At least that's what I think your example is trying to explain, but it's too terse to make that clear to someone who's not already familiar with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence "We also explicitly do not test certain projects..." I think that sentence is referring to premerge testing rather than postsubmit? But you need` to make that clear. Also this becomes as run-on sentence. I'd take the last phrase and make it a separate sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the example to better explain the exact sequence of events and the logic controlling them. I can add some background on buildbot notifications in the beginning if that's helpful, but I hope the new example should make things clear enough.

this because we suspect interactions resulting in test failures would be quite
rare, it is possible, and having a postcommit configuration catch these rare
failures would be useful.