-
Notifications
You must be signed in to change notification settings - Fork 121
[CI] Write Routine Maintenance Document #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
boomanaiden154
wants to merge
3
commits into
llvm:main
Choose a base branch
from
boomanaiden154:routine-maintenance-doc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+136
−0
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| # Routine Maintenance | ||
|
|
||
| The cluster requires routine maintenance to ensure it stays functioning. | ||
| Ideally, this maintenance is proactive, being performed before any issues arise | ||
| from neglect. This document aims to describe the routine maintenance needed | ||
| and how to perform it. | ||
|
|
||
| ## Version Updates | ||
|
|
||
| The only routine maintenance that we currently do on the premerge | ||
| infrastructure are version updates. The infrastructure utilizes a lot of | ||
| different software, and all of it needs to be kept reasonably up to date | ||
| to ensure things keep working smoothly and that we are not vulernable to | ||
| security issues. | ||
|
|
||
| ### Getting Notified of Version Updates | ||
|
|
||
| There are several pieces of software that we want to upgrade relatively | ||
| quickly (like the Github Actions Runner binary). Because of that, knowing | ||
| when a new version is released is important. The easiest way to do this is | ||
| to subscribe to new release notifications on Github. If you go to a | ||
| repository, you can click on the watch button, select custom, and then | ||
| select releases. Any new releases for that repository will show up in | ||
| your Github notifications. | ||
|
|
||
| Releases from the following repositories generally require an update on the | ||
| premerge infrastructure side: | ||
|
|
||
| 1. https://github.com/actions/actions-runner-controller | ||
| 2. https://github.com/actions/runner | ||
| 3. https://github.com/llvm/llvm-project | ||
|
|
||
| ### Github Actions Runner Binary | ||
|
|
||
| The runner binary is what runs inside the containers on the cluster to execute | ||
| jobs and report status results back to Github. The runner binary has a | ||
| relatively short time horizon (about six months) before it becomes unsupported | ||
| by Github and it will no longer work. | ||
|
|
||
| When a new runner binary is released, there are three places that need to | ||
| be updated in a PR against the LLVM monorepo: | ||
|
|
||
| 1. The Linux CI container - The `Dockerfile` at | ||
| `.github/workflows/containers/github-action-ci/Dockerfile` has an environment | ||
| variable towards the bottom of the file called `GITHUB_RUNNER_VERSION` that | ||
| needs to be updated to the new version. | ||
| 2. The Windows CI container - The `Dockerfile` at | ||
| `.github/workflows/containers/github-action-ci-windows/Dockerfile` has an | ||
| argument called `RUNNER_VERSION` near the bottom of the file that needs to | ||
| be updated to the new version. | ||
| 3. The libc++ CI container - The `docker-compose` manifest at | ||
| `libcxx/utils/ci/docker-compose.yml` needs to be updated by updating the | ||
| `GITHUB_RUNNER_VERSION` variable to pull in the latest | ||
| runner binary using the [libc++ instructions](https://libcxx.llvm.org/Contributing.html#updating-the-ci-testing-container-images) | ||
|
|
||
| ### Other Container Image Software | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The container images also contain many other pieces of software critical | ||
| for building LLVM, like CMake, ninja, and the toolchain itself. Keeping | ||
| most of these up to date is ideal. | ||
|
|
||
| A large amount of the software comes from the Linux distribution and thus does | ||
| not need to be explicitly updated. We prefer to install software from the | ||
| distribution when possible. However, this does mean that distribution updates | ||
| are quite important. To update the distribution for the Linux container, perform | ||
| the following steps: | ||
|
|
||
| 1. Modify `.github/workflows/containers/github-action-ci/Dockerfile` locally | ||
| to use the latest `ubuntu:xx.04` image. | ||
| 2. Ensure that `monolithic-linux.sh` with all options enabled runs successfully. | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Push any needed changes. You can see the canonical invocation with all options | ||
| in `zorg/buildbot/builders/annotated/premerge/dispatch_job.py` within the | ||
| `llvm-zorg` repository. | ||
| 3. Push the updated `Dockerfile` and update the workflow in | ||
| `.github/workflows/build-ci-container.yml` to use the correct image name, specifically | ||
| by updating the `container-name` value to use the correct ubuntu version. | ||
| 4. Update the runner configuration in zorg (`premerge/linux_runners_values.yaml`) | ||
| to point to the new image. | ||
|
|
||
| Updating explicitly versioned software (just LLVM in the Linux container, but | ||
| most software in the Windows container) just requires bumping the version number | ||
| and pushing the new image to the monorepo. In the Linux container, the LLVM | ||
| version can be bumped by changing the `LLVM_VERSION` environment variable. In | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| the Windows container, versions of many of the packages we install are | ||
| controlled by the `--version` flag passed to `choco install` commands. | ||
|
|
||
| ### Actions Runner Controller | ||
|
|
||
| The actions runner controller orchestrates all of the jobs on the cluster. | ||
| Given its key role, upgrading it needs to be done carefully using the | ||
| [described steps](cluster-management.md#upgradingresetting-github-arc) to | ||
| avoid any downtime. | ||
|
|
||
| It is advised to do this during a portion of the day with light traffic as | ||
| any upgrade will involve having an entire cluster down at times, which reduces | ||
| capacity in half. | ||
|
|
||
| ### Windows Edition | ||
|
|
||
| Whenever a new Windows Server datacenter edition (eg 2025) is supported by GKE, | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| all of the windows infrastructure needs to be updated. This involves several | ||
| steps that will take significant time: | ||
|
|
||
| 1. Modify `.github/workflows/containers/github-action-ci-windows/Dockerfile` | ||
| locally to get it building on the new Windows Server version. This requires | ||
| having a host at that version as Windows Server containers can only run on a | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| host with the same edition. Obtaining a host is most easily done by creating | ||
| a GCP virtual machine in the premerge project, for this with access. | ||
| Then, changing the `FROM` line at the top of the | ||
| `Dockerfile` has previously been enough. | ||
| 2. When that is ready, push your changes in a PR along with changes to | ||
| `.github/workflows/build-ci-container-windows.yml` so that it uses the new | ||
| Windows Server edition. The only changes needed in the workflow file assuming | ||
| things go smoothly are the `runs-on` lines to reference the new Windows version. | ||
| 3. Test locally that the pushed container image can run the | ||
| `monolithic-windows.sh` script with all projects enabled successfully. Make | ||
| and submit any changes that are needed. To test the script, | ||
| use `docker run` to launch the new container from the LLVM packages list | ||
| (https://github.com/orgs/llvm/packages). Then, clone the repository into | ||
| the container with | ||
| `git clone --config core.autocrlf=false https://github.com/llvm/llvm-project`. | ||
| Move into that directory, and then run the `monolithic-windows.sh` script. | ||
| An invocation should look like | ||
| `.ci/monolithic-windows.sh "clang;clang-tools-extra;libclc;lld;llvm;mlir;polly" "check-clang check-clang-tools check-lld check-llvm check-mlir check-polly"`. | ||
| The most recent version of the command should be in | ||
| `buildbot/builders/annotated/premerge/dispatch_job.py` within the `llvm-zorg` | ||
| repository. | ||
| 4. Duplicate all of the versioned Windows resources in the terraform | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| configuration and apply it. This involves duplicating everything referencing | ||
| the old windows version in the `premerge/` folder in the llvm-zorg repo and | ||
| updating it to be named for/reference the new windows version. | ||
| 5. Switch the workflow in `.github/workflows/premerge.yaml` to the new | ||
| runner set. | ||
| 6. Sunset the existing runnerset in 1-2 weeks by deleting it and any other | ||
| resources specific to the old Windows version to give time for people utilizing | ||
| stacked PRs to rebase. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.