-
Notifications
You must be signed in to change notification settings - Fork 113
[CI] Add design document for post submit testing #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,169 @@ | ||||||
# Post Submit Testing | ||||||
|
||||||
## Introduction | ||||||
|
||||||
While this infrastructure is focused on premerge testing, it is also important | ||||||
to make sure that the specific configuration we are testing is tested post | ||||||
commit as well. This document outlines the motivation for the need to test this | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
configuration post commit, why we are utilizing this design over others, and | ||||||
how we plan on implementing this to ensure we get fast feedback scalably. | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Background/Motivation | ||||||
|
||||||
LLVM has two types of testing upstream: premerge and postcommit. The premerge | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default assumption for this document is going to be that "LLVM" refers to upstream llvm, so you can just say "testing" I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a bit redundant, but I'd prefer to keep it in there. There will probably be internal (to Google) audiences reading this who might not be as familiar with everything upstream and we have a lot of downstream CI, so keeping it explicit might make it more clear for someone and only be redundant for someone else. |
||||||
testing is performed using Github Actions every time a PR is updated before it | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
is merged. Premerge testing is performed using this infrastructure (specifically | ||||||
the `./premerge` folder in llvm-zorg). Landing a PR consists of squashing the | ||||||
changes into a single commit and adding that commit to the `main` branch in the | ||||||
monorepo. We care specifically about the state of the `main` branch because it | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
is what the community considers the canonical tree. Currently, commits can also | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
be added to the `main` branch by directly pushing to the main branch. After a | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
new commit lands in the `main` branch, postcommit testing is performed. Most | ||||||
postcommit testing is performed through the Buildbot infrastructure. The main | ||||||
Buildbot instance for LLVM has a web instance hosted at | ||||||
[lab.llvm.org](https:/lab.llvm.org). When a new commit lands in `main` the | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Buildbot instance (sometimes referred to as the Buildbot master) will trigger | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should also mention that the current postcommit buildbots don't run separate instances for each commit -- they bundle multiple commits together. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This depends upon the configuration and bundling commits is explicitly discouraged in https://llvm.org/docs/HowToAddABuilder.html. I've added some text to describe this. |
||||||
many different build configurations. These configurations are defined in the | ||||||
llvm-zorg repository under the `buildbot/` folder. These configurations are run | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
on Buildbot workers that are hosted by the community. | ||||||
|
||||||
It is important that we test the premerge configuration postcommit as well. We | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re-work this paragraph. Motivations should start by stating what the problem is you're trying to solve (sending false positive notifications to authors about builds/tests already failing at head). Once the problem is explained (including WHY it's important/impactful to solve this problem), then you explain how you plan to solve it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, the phrase "for certain types of automation" is so vague as to be virtually useless. I would either add more details or omit it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reworked the paragraph to talk about what we want to do in the end at the top and then go into the how. Removed the vague wording. |
||||||
need to be able to determine the state of `main` (in terms of whether the build | ||||||
passed/failed and what tests failed, if any) for certain types of automation. | ||||||
Postcommit testing enables easily checking the state of `main` at any given point | ||||||
in time. This data is crucial for figuring out which commit to revert/fix | ||||||
forward to get everything back to green. Having information on the state of | ||||||
`main` is also important for certain kinds of automation, like the planned | ||||||
premerge testing advisor that will let contributors know if tests failing in | ||||||
their PR are already failing at `main` and that it should be safe to merge | ||||||
despite the given failures. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are some very large paragraphs. The content is fine but a line break between each "point" made would make it easier to read through. |
||||||
|
||||||
## Design | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The LLVM Premerge system has two clusters, namely the central cluster in the GCP | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
zone `us-central1-a` and the west cluster in the GCP zone `us-west1`. We run | ||||||
two clusters in different zones for redundancy so that if one fails, we can | ||||||
still run jobs. For postcommit testing, we plan on setting up builders attached | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
to the Buildbot master described above. We will run one builder on the central | ||||||
cluster and one in the west cluster. This ensures the configuration is highly | ||||||
available (able to tolerate an entire cluster going down), similar to the | ||||||
premerge testing. The builders will be configured to use a script that will | ||||||
launch builds on each commit to `main` in a similar configuration to the one run | ||||||
premerge. The test configuration is intended to be close to the premerge | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "a similar configuration to the one run premerge" => "the same configuration used for premerge testing (for PR commits); for direct Replace the sentence "The test configuration is intended to be close to the premerge configuration but will be different in some key ways." with: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the distinction between directly pushed commits/commits from PRs is useful here. I've changed up the wording to say the commit will be tested as if it is going through the premerge pipeline, with some minor differences. Ack on the second point, changed up the wording. |
||||||
configuration but will be different in some key ways. The differences and | ||||||
motivation for them is described more thoroughly in the | ||||||
[testing configuration](#testing-configuration) section. These builds will be | ||||||
run inside containers that are distributed onto the cluster inside kubernetes | ||||||
pods (the fundamental schedulable unit inside kubernetes). This allows for | ||||||
kubernetes to handle details like what machine a build should run on. Allowing | ||||||
kubernetes to handle these details also enables GKE to autoscale the node pools | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
so we are not paying for uneeded capacity. Launching builds inside pods | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
also allows for each builder to handle multiple builds at the same time. | ||||||
|
||||||
In terms of the full flow, any commit (which can be from direct pushes or | ||||||
merging pull requests) pushed to the LLVM monorepo will get detected by the | ||||||
buildbot master. The Buildbot master will invoke Buildbot workers running on our | ||||||
clusters. These Buildbot workers will use custom builders to launch a build | ||||||
wrapped in a kubernetes pod and report the results back to the buildbot master. | ||||||
When the job is finished, the pod will complete and capacity will be available | ||||||
for another build, or if there is nothing left to test GKE will see that there | ||||||
is nothing running on one of the nodes and downscale the node pool. | ||||||
|
||||||
### Annotated Builder | ||||||
|
||||||
llvm-zorg has multiple types of builders. We plan on using an AnnotatedBuilder. | ||||||
AnnotatedBuilders allow for the build to be driven using a custom python script | ||||||
rather than directly dictating the shell commands that should be run to perform | ||||||
the build. We need the flexibility of the AnnotatedBuilder to deploy jobs on the | ||||||
cluster. AnnotatedBuilder based builders also enable deploying changes without | ||||||
needing to restart the buildbot master. Without this, we have to wait for an | ||||||
administrator of the LLVM buildbot master to restart it before our changes get | ||||||
deployed. This could significantly delay updates or responses to incidents, | ||||||
especially before the system is fully stable and we need to modify it relatively | ||||||
frequently. | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### Build Distribution | ||||||
|
||||||
We want to be able to take advantage of the autoscaling functionality of the | ||||||
new cluster to efficiently utilize resources. To do this, we plan on having the | ||||||
AnnotatedBuilder script launch builds as kubernetes pods. This allows for | ||||||
kubernetes to assign the builds to nodes and also allows autoscaling through | ||||||
the same mechanism that Github autoscales. This allows for us to quickly | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Github autoscales" => "Github uses to autoscale". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: "This allows...also allows...This allows..." Might read better if you didn't use 'this allows" so frequently in such a short space. (Some alternatives: "enables", "lets us") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also I think you mean the same mechanism that LLVM testing on GitHub uses, not GitHub the company specifically uses it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarified to Github ARC so that it's clear it's the software running on our cluster. |
||||||
process builds at peak times and not pay for extra capacity when commit | ||||||
traffic is quiet. This is essentially for ensuring our resource use is | ||||||
efficient while still providing fast feedback. | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Using the kubernetes API inside of a python script to launch builds does add | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
some complexity. However, we do not believe we need too much added | ||||||
complexity to achieve our goal here and this allows for vastly more efficient | ||||||
resource usage. | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### Testing Configuration | ||||||
|
||||||
The testing configuration will be as close to the premerge configuration as | ||||||
possible. We will be running all tests inside the same container with the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First sentence. "The actual testing configuration will be the same as that used by the premerge testing." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm using testing configuration to include the set of projects that we're testing (which will be different) and the environment. The environment should be as close as possible, but I don't think there's any easy way to guarantee that it's the same. |
||||||
same scripts (the `monolithic-linux.sh` and `monolithic-windows.sh` scripts). | ||||||
However, there will be one main difference between the premerge and postcommit | ||||||
configuration. In the postcommit configuration we propose testing all projects | ||||||
on every commit rather than only testing the projects that themselves changed | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. line 107 should become: "testing. In the postcommit testing, we propose testing all projects" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've updated the text slightly to make it a bit more clear. I believe you were trying to get at replacing configuration with testing? I think configuration is more accurate here and more consistent with the rest of the paragraph (especially after adding the definition at the beginning). I can change it up if it is inhibiting clarity though. |
||||||
or had dependencies that changed. We propose this for two main reasons. | ||||||
Firstly, Buildbot does not have support for heterogenous build configurations. | ||||||
This means that testing a different set of projects in a single build | ||||||
configuration depending upon what files changed could easily produce many | ||||||
more notifications if certain configurations were failing and some were | ||||||
passing which defeats the point of using Buildbot in the first place. For | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
example, if there is a MLIR change that fails, an unrelated clang-tidy change | ||||||
that passes all tests that lands afterwards, and then another MLIR change, a | ||||||
notification will also be sent out on the second MLIR change because the | ||||||
clang-tidy change turned the build back to green. We also explicitly do not | ||||||
test certain projects even though their dependencies change, and while we do | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. THe MLIR example needs clarification: 1). You need to make it clear that you're talking about postsubmit testing only. 2). You need to explain more about turning red/green -- that reach postsubmit run will mark all tests/builds as either red or green, and THAT is why you need to run all the tests every time: to avoid passing in one part from incorrectly marking failures in a different part as passing when they're not. At least that's what I think your example is trying to explain, but it's too terse to make that clear to someone who's not already familiar with this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The sentence "We also explicitly do not test certain projects..." I think that sentence is referring to premerge testing rather than postsubmit? But you need` to make that clear. Also this becomes as run-on sentence. I'd take the last phrase and make it a separate sentence. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rewrote the example to better explain the exact sequence of events and the logic controlling them. I can add some background on buildbot notifications in the beginning if that's helpful, but I hope the new example should make things clear enough. |
||||||
this because we suspect interactions resulting in test failures would be quite | ||||||
rare, it is possible, and having a postcommit configuration catch these rare | ||||||
failures would be useful. | ||||||
|
||||||
### Data Storage | ||||||
|
||||||
The hosted Buildbot master instance at [lab.llvm.org](https://lab.llvm.org) | ||||||
contains results for all recent postcommit runs. We plan on quetying the results | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
from the buildbot master because they are already available and that is where | ||||||
they will natively be reported after the infrastructure is set up. Buildbot | ||||||
supports a [REST API](https://docs.buildbot.net/latest/developer/rest.html) that | ||||||
would allow for easily querying the state of a commit in `main`. | ||||||
|
||||||
For the proposed premerge advisor that tells the user what tests/build failures | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: this refers to a "proposed premerge advisor" but that's the first time this is mentioned in the doc, and the casual reader having no context (me) will be lost. Is there a link to some proposal that should be added here? This section feel like a bit out-of-place for a document describing the "post-merge testing" flow IMO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (this feature looks really nice otherwise) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's mentioned briefly in the last paragraph of the background/motivation section. There is no proposal yet, but I'm hoping to write an RFC/document soon to share with the community once I've fleshed out the details. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems to me it would fit better the flow here to phrase it this way I think:
Suggested change
|
||||||
they can safely ignore, we need to know what is currently failing on `main`. | ||||||
Each pull request is tested as if it was merged into main, which means the | ||||||
commit underneath the PR is very recent. If a premerge run fails, the premerge | ||||||
advisor will find the commit from `main` the PR is being tested on. It will then | ||||||
query the Buildbot master using the REST API for the status of that commit. | ||||||
It can then report the appropriate status to the user. | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Alternatives Considered | ||||||
|
||||||
Originally, we were looking at running postcommit testing through Github | ||||||
Actions as well. This is primarily due to it being easy (a single line | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
change in the Github Actions workflow config) and also easy integration | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
with the Github API for implementation of the premerge testing advisor. | ||||||
More detailed motivation for the doing postcommit testing directly through | ||||||
Github is available in the [discourse RFC thread](https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124) | ||||||
where we proposed doing this. We eventually decided against implementation in | ||||||
this way for a couple of reasons: | ||||||
|
||||||
1. Nonstandard - The standard postcommit testing infrastructure for LLVM is | ||||||
through Buildbot. Doing postcommit testing for the premerge configuration | ||||||
through Github would represent a significant deparature from this. This means | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
we are leaving behind some common infrastructure and are also forcing a new | ||||||
unfamiliar postcommit interface on LLVM contributors. | ||||||
2. Notifications - This is the biggest issue. Github currently gives very | ||||||
little control over the notifications that are sent out when the build | ||||||
fails or gets cancelled. This is specifically a problem with Github sending | ||||||
out notifications for build failures even if the previous build has failed. | ||||||
This can easily create a lot of warning fatigue which is something we are | ||||||
putting a lot of effort in to avoid so that the premerge system is | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
percieved as reliable, people trust its results, and most importantly, | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
people pay attention to failures when they do occur because they are | ||||||
caused by the patch getting the notification and are actionable. | ||||||
3. Customization - Buildbot can be customized around issues like notifications | ||||||
whereas Github cannot. Github is not particularly responsive on feature | ||||||
requests and their notification story has been poor for a while, so their | ||||||
lack of customization is a strategic risk. |
Uh oh!
There was an error while loading. Please reload this page.