-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[libc++] Switch over to the LLVM-wide premerge test runners. #147794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update the premerge testing system to use the LLVM-wide premerge infrastructure. Also remove libcxx-restart-preempted-jobs.yaml, as this should no longer be needed.
Just testing this for now. Don't review yet. |
There seems to be a problem: Things are being queued rather than running...need to figure this out. |
You should rebase onto |
It looks like the runner version in your container image is too old for the runner to connect to Github. We are working on a solution to update the runner version in the container image (it's a Work-In-Progress). Ideally we will update the runner version without having to bump any of the tools or library versions. |
To put a name to it, this is the issue we discussed during initial libc++ monthly meeting. I struggled to install the runner binaries into the container image manually because, at the time, the manual installation didn't support auto-scaling ephemeral runners. I'm hopeful the additional maturity of Github Actions has resolved this. |
It seems like that has changed. We've been manually installing the runner binaries for the monorepo wide premerge since we built out the infrastructure.
|
I would take a look at the documentation regarding the minimal runner container image. I think we should try to mirror that setup as close we can. I worry about the possible subtle effects of omitting something like But otherwise I think the LLVM image approach would suite libc++ as well. |
This is looking awesome! |
This looks ready to me. Once it's converted from a draft, I'll add my stamp. |
Done. :-) |
@llvm/pr-subscribers-github-workflow Author: None (cmtice) ChangesUpdate the premerge testing system to use the LLVM-wide premerge infrastructure. Also remove libcxx-restart-preempted-jobs.yaml, as this should no longer be needed. Full diff: https://github.com/llvm/llvm-project/pull/147794.diff 2 Files Affected:
diff --git a/.github/workflows/libcxx-build-and-test.yaml b/.github/workflows/libcxx-build-and-test.yaml
index f0bdf6c0b5899..ec937de02ca1a 100644
--- a/.github/workflows/libcxx-build-and-test.yaml
+++ b/.github/workflows/libcxx-build-and-test.yaml
@@ -36,8 +36,7 @@ concurrency:
jobs:
stage1:
if: github.repository_owner == 'llvm'
- runs-on: libcxx-self-hosted-linux
- container: ghcr.io/llvm/libcxx-linux-builder:b060022103f551d8ca1dad84122ef73927c86512
+ runs-on: llvm-premerge-libcxx-runners
continue-on-error: false
strategy:
fail-fast: false
@@ -74,8 +73,7 @@ jobs:
**/crash_diagnostics/*
stage2:
if: github.repository_owner == 'llvm'
- runs-on: libcxx-self-hosted-linux
- container: ghcr.io/llvm/libcxx-linux-builder:2b57ebb50b6d418e70382e655feaa619b558e254
+ runs-on: llvm-premerge-libcxx-runners
needs: [ stage1 ]
continue-on-error: false
strategy:
@@ -149,21 +147,20 @@ jobs:
'generic-static',
'bootstrapping-build'
]
- machine: [ 'libcxx-self-hosted-linux' ]
+ machine: [ 'llvm-premerge-libcxx-runners' ]
include:
- config: 'generic-cxx26'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
- config: 'generic-asan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
- config: 'generic-tsan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
- config: 'generic-ubsan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
# Use a larger machine for MSAN to avoid timeout and memory allocation issues.
- config: 'generic-msan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
runs-on: ${{ matrix.machine }}
- container: ghcr.io/llvm/libcxx-linux-builder:2b57ebb50b6d418e70382e655feaa619b558e254
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: ${{ matrix.config }}
diff --git a/.github/workflows/libcxx-restart-preempted-jobs.yaml b/.github/workflows/libcxx-restart-preempted-jobs.yaml
deleted file mode 100644
index accb84efb5c90..0000000000000
--- a/.github/workflows/libcxx-restart-preempted-jobs.yaml
+++ /dev/null
@@ -1,158 +0,0 @@
-name: Restart Preempted Libc++ Workflow
-
-# The libc++ builders run on preemptable VMs, which can be shutdown at any time.
-# This workflow identifies when a workflow run was canceled due to the VM being preempted,
-# and restarts the workflow run.
-
-# We identify a canceled workflow run by checking the annotations of the check runs in the check suite,
-# which should contain the message "The runner has received a shutdown signal."
-
-# Note: If a job is both preempted and also contains a non-preemption failure, we do not restart the workflow.
-
-on:
- workflow_run:
- workflows: [Build and Test libc\+\+]
- types:
- - completed
-
-permissions:
- contents: read
-
-jobs:
- restart:
- if: github.repository_owner == 'llvm' && (github.event.workflow_run.conclusion == 'failure')
- name: "Restart Job"
- permissions:
- statuses: read
- checks: write
- actions: write
- runs-on: ubuntu-24.04
- steps:
- - name: "Restart Job"
- uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea #v7.0.1
- with:
- script: |
- // The "The run was canceled by" message comes from a user manually canceling a workflow
- // the "higher priority" message comes from github canceling a workflow because the user updated the change.
- // And the "exit code 1" message indicates a genuine failure.
- const failure_regex = /(Process completed with exit code 1.)/
- const preemption_regex = /(The runner has received a shutdown signal)|(The operation was canceled)/
-
- const wf_run = context.payload.workflow_run
- core.notice(`Running on "${wf_run.display_title}" by @${wf_run.actor.login} (event: ${wf_run.event})\nWorkflow run URL: ${wf_run.html_url}`)
-
-
- async function create_check_run(conclusion, message) {
- // Create a check run on the given workflow run to indicate if
- // we are restarting the workflow or not.
- if (conclusion != 'success' && conclusion != 'skipped' && conclusion != 'neutral') {
- core.setFailed('Invalid conclusion: ' + conclusion)
- }
- await github.rest.checks.create({
- owner: context.repo.owner,
- repo: context.repo.repo,
- name: 'Restart Preempted Job',
- head_sha: wf_run.head_sha,
- status: 'completed',
- conclusion: conclusion,
- output: {
- title: 'Restarted Preempted Job',
- summary: message
- }
- })
- }
-
- console.log('Listing check runs for suite')
- const check_suites = await github.rest.checks.listForSuite({
- owner: context.repo.owner,
- repo: context.repo.repo,
- check_suite_id: context.payload.workflow_run.check_suite_id,
- per_page: 100 // FIXME: We don't have 100 check runs yet, but we should handle this better.
- })
-
- check_run_ids = [];
- for (check_run of check_suites.data.check_runs) {
- console.log('Checking check run: ' + check_run.id);
- if (check_run.status != 'completed') {
- console.log('Check run was not completed. Skipping.');
- continue;
- }
- if (check_run.conclusion != 'failure') {
- console.log('Check run had conclusion: ' + check_run.conclusion + '. Skipping.');
- continue;
- }
- check_run_ids.push(check_run.id);
- }
-
- has_preempted_job = false;
-
- for (check_run_id of check_run_ids) {
- console.log('Listing annotations for check run: ' + check_run_id);
-
- annotations = await github.rest.checks.listAnnotations({
- owner: context.repo.owner,
- repo: context.repo.repo,
- check_run_id: check_run_id
- })
-
- // For temporary debugging purposes to see the structure of the annotations.
- console.log(annotations);
-
- has_failed_job = false;
- saved_failure_message = null;
-
- for (annotation of annotations.data) {
- if (annotation.annotation_level != 'failure') {
- continue;
- }
-
- const preemption_match = annotation.message.match(preemption_regex);
-
- if (preemption_match != null) {
- console.log('Found preemption message: ' + annotation.message);
- has_preempted_job = true;
- }
-
- const failure_match = annotation.message.match(failure_regex);
- if (failure_match != null) {
- has_failed_job = true;
- saved_failure_message = annotation.message;
- }
- }
- if (has_failed_job && (! has_preempted_job)) {
- // We only want to restart the workflow if all of the failures were due to preemption.
- // We don't want to restart the workflow if there were other failures.
- //
- // However, libcxx runners running inside docker containers produce both a preemption message and failure message.
- //
- // The desired approach is to ignore failure messages which appear on the same job as a preemption message
- // (An job is a single run with a specific configuration, ex generic-gcc, gcc-14).
- //
- // However, it's unclear that this code achieves the desired approach, and it may ignore all failures
- // if a preemption message is found at all on any run.
- //
- // For now, it's more important to restart preempted workflows than to avoid restarting workflows with
- // non-preemption failures.
- //
- // TODO Figure this out.
- core.notice('Choosing not to rerun workflow because we found a non-preemption failure' +
- 'Failure message: "' + saved_failure_message + '"');
- await create_check_run('skipped', 'Choosing not to rerun workflow because we found a non-preemption failure\n'
- + 'Failure message: ' + saved_failure_message)
- return;
- }
- }
-
- if (!has_preempted_job) {
- core.notice('No preempted jobs found. Not restarting workflow.');
- await create_check_run('neutral', 'No preempted jobs found. Not restarting workflow.')
- return;
- }
-
- core.notice("Restarted workflow: " + context.payload.workflow_run.id);
- await github.rest.actions.reRunWorkflowFailedJobs({
- owner: context.repo.owner,
- repo: context.repo.repo,
- run_id: context.payload.workflow_run.id
- })
- await create_check_run('success', 'Restarted workflow run due to preempted job')
|
@llvm/pr-subscribers-libcxx Author: None (cmtice) ChangesUpdate the premerge testing system to use the LLVM-wide premerge infrastructure. Also remove libcxx-restart-preempted-jobs.yaml, as this should no longer be needed. Full diff: https://github.com/llvm/llvm-project/pull/147794.diff 2 Files Affected:
diff --git a/.github/workflows/libcxx-build-and-test.yaml b/.github/workflows/libcxx-build-and-test.yaml
index f0bdf6c0b5899..ec937de02ca1a 100644
--- a/.github/workflows/libcxx-build-and-test.yaml
+++ b/.github/workflows/libcxx-build-and-test.yaml
@@ -36,8 +36,7 @@ concurrency:
jobs:
stage1:
if: github.repository_owner == 'llvm'
- runs-on: libcxx-self-hosted-linux
- container: ghcr.io/llvm/libcxx-linux-builder:b060022103f551d8ca1dad84122ef73927c86512
+ runs-on: llvm-premerge-libcxx-runners
continue-on-error: false
strategy:
fail-fast: false
@@ -74,8 +73,7 @@ jobs:
**/crash_diagnostics/*
stage2:
if: github.repository_owner == 'llvm'
- runs-on: libcxx-self-hosted-linux
- container: ghcr.io/llvm/libcxx-linux-builder:2b57ebb50b6d418e70382e655feaa619b558e254
+ runs-on: llvm-premerge-libcxx-runners
needs: [ stage1 ]
continue-on-error: false
strategy:
@@ -149,21 +147,20 @@ jobs:
'generic-static',
'bootstrapping-build'
]
- machine: [ 'libcxx-self-hosted-linux' ]
+ machine: [ 'llvm-premerge-libcxx-runners' ]
include:
- config: 'generic-cxx26'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
- config: 'generic-asan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
- config: 'generic-tsan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
- config: 'generic-ubsan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
# Use a larger machine for MSAN to avoid timeout and memory allocation issues.
- config: 'generic-msan'
- machine: libcxx-self-hosted-linux
+ machine: llvm-premerge-libcxx-runners
runs-on: ${{ matrix.machine }}
- container: ghcr.io/llvm/libcxx-linux-builder:2b57ebb50b6d418e70382e655feaa619b558e254
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: ${{ matrix.config }}
diff --git a/.github/workflows/libcxx-restart-preempted-jobs.yaml b/.github/workflows/libcxx-restart-preempted-jobs.yaml
deleted file mode 100644
index accb84efb5c90..0000000000000
--- a/.github/workflows/libcxx-restart-preempted-jobs.yaml
+++ /dev/null
@@ -1,158 +0,0 @@
-name: Restart Preempted Libc++ Workflow
-
-# The libc++ builders run on preemptable VMs, which can be shutdown at any time.
-# This workflow identifies when a workflow run was canceled due to the VM being preempted,
-# and restarts the workflow run.
-
-# We identify a canceled workflow run by checking the annotations of the check runs in the check suite,
-# which should contain the message "The runner has received a shutdown signal."
-
-# Note: If a job is both preempted and also contains a non-preemption failure, we do not restart the workflow.
-
-on:
- workflow_run:
- workflows: [Build and Test libc\+\+]
- types:
- - completed
-
-permissions:
- contents: read
-
-jobs:
- restart:
- if: github.repository_owner == 'llvm' && (github.event.workflow_run.conclusion == 'failure')
- name: "Restart Job"
- permissions:
- statuses: read
- checks: write
- actions: write
- runs-on: ubuntu-24.04
- steps:
- - name: "Restart Job"
- uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea #v7.0.1
- with:
- script: |
- // The "The run was canceled by" message comes from a user manually canceling a workflow
- // the "higher priority" message comes from github canceling a workflow because the user updated the change.
- // And the "exit code 1" message indicates a genuine failure.
- const failure_regex = /(Process completed with exit code 1.)/
- const preemption_regex = /(The runner has received a shutdown signal)|(The operation was canceled)/
-
- const wf_run = context.payload.workflow_run
- core.notice(`Running on "${wf_run.display_title}" by @${wf_run.actor.login} (event: ${wf_run.event})\nWorkflow run URL: ${wf_run.html_url}`)
-
-
- async function create_check_run(conclusion, message) {
- // Create a check run on the given workflow run to indicate if
- // we are restarting the workflow or not.
- if (conclusion != 'success' && conclusion != 'skipped' && conclusion != 'neutral') {
- core.setFailed('Invalid conclusion: ' + conclusion)
- }
- await github.rest.checks.create({
- owner: context.repo.owner,
- repo: context.repo.repo,
- name: 'Restart Preempted Job',
- head_sha: wf_run.head_sha,
- status: 'completed',
- conclusion: conclusion,
- output: {
- title: 'Restarted Preempted Job',
- summary: message
- }
- })
- }
-
- console.log('Listing check runs for suite')
- const check_suites = await github.rest.checks.listForSuite({
- owner: context.repo.owner,
- repo: context.repo.repo,
- check_suite_id: context.payload.workflow_run.check_suite_id,
- per_page: 100 // FIXME: We don't have 100 check runs yet, but we should handle this better.
- })
-
- check_run_ids = [];
- for (check_run of check_suites.data.check_runs) {
- console.log('Checking check run: ' + check_run.id);
- if (check_run.status != 'completed') {
- console.log('Check run was not completed. Skipping.');
- continue;
- }
- if (check_run.conclusion != 'failure') {
- console.log('Check run had conclusion: ' + check_run.conclusion + '. Skipping.');
- continue;
- }
- check_run_ids.push(check_run.id);
- }
-
- has_preempted_job = false;
-
- for (check_run_id of check_run_ids) {
- console.log('Listing annotations for check run: ' + check_run_id);
-
- annotations = await github.rest.checks.listAnnotations({
- owner: context.repo.owner,
- repo: context.repo.repo,
- check_run_id: check_run_id
- })
-
- // For temporary debugging purposes to see the structure of the annotations.
- console.log(annotations);
-
- has_failed_job = false;
- saved_failure_message = null;
-
- for (annotation of annotations.data) {
- if (annotation.annotation_level != 'failure') {
- continue;
- }
-
- const preemption_match = annotation.message.match(preemption_regex);
-
- if (preemption_match != null) {
- console.log('Found preemption message: ' + annotation.message);
- has_preempted_job = true;
- }
-
- const failure_match = annotation.message.match(failure_regex);
- if (failure_match != null) {
- has_failed_job = true;
- saved_failure_message = annotation.message;
- }
- }
- if (has_failed_job && (! has_preempted_job)) {
- // We only want to restart the workflow if all of the failures were due to preemption.
- // We don't want to restart the workflow if there were other failures.
- //
- // However, libcxx runners running inside docker containers produce both a preemption message and failure message.
- //
- // The desired approach is to ignore failure messages which appear on the same job as a preemption message
- // (An job is a single run with a specific configuration, ex generic-gcc, gcc-14).
- //
- // However, it's unclear that this code achieves the desired approach, and it may ignore all failures
- // if a preemption message is found at all on any run.
- //
- // For now, it's more important to restart preempted workflows than to avoid restarting workflows with
- // non-preemption failures.
- //
- // TODO Figure this out.
- core.notice('Choosing not to rerun workflow because we found a non-preemption failure' +
- 'Failure message: "' + saved_failure_message + '"');
- await create_check_run('skipped', 'Choosing not to rerun workflow because we found a non-preemption failure\n'
- + 'Failure message: ' + saved_failure_message)
- return;
- }
- }
-
- if (!has_preempted_job) {
- core.notice('No preempted jobs found. Not restarting workflow.');
- await create_check_run('neutral', 'No preempted jobs found. Not restarting workflow.')
- return;
- }
-
- core.notice("Restarted workflow: " + context.payload.workflow_run.id);
- await github.rest.actions.reRunWorkflowFailedJobs({
- owner: context.repo.owner,
- repo: context.repo.repo,
- run_id: context.payload.workflow_run.id
- })
- await create_check_run('success', 'Restarted workflow run due to preempted job')
|
Done -- it's ready for official review now. :-) |
Just FYI: The container image we're using here is a bit of a hack given we needed to bump the runner binary version. I took the existing container, deleted the runner binary, and then reinstalled it and uploaded it to the registry so that we could use it:
This shouldn't be an issue when building the next container as #147831 has already landed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Maybe not worth reiterating at this point given I think everyone is on the same page, but let's make sure to wait until the libc++ maintainers have taken a look before landing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm comfortable approving this for the project since rolling it back is trivial.
We should monitor the change to ensure it can handle the load, but otherwise I think this is likely to stick.
There are still a few things we should figure out after this lands. In particular: How will the container images get updated, who's responsible, and what can we do to make this process as automated as possible? As it stand now, upgrading the container is still a manual process and a bit of a pain. With work we can cleanup our Dockerfile to better support rebuilding without affecting the compiler/tool versions. But until then, this process seems manual, and we should ensure it's done with a regular cadence. |
Do we have documentation explaining the names/purposes of the 3 runner groups, and how libc++ developers should interact or update them? |
Not yet -- where should this documentation go? |
I was hoping to get to this refactoring by the end of the week. |
Probably under libcxx/docs. I don't have a strong preference. I'm not sure if it fits well into existing files/sections, but in that case you can add a new rst file.
Amazing! It's been a long standing wart. |
And now we just wait for users to rebase their pull requests (I forgot about that part). I'm tempted to turn down the existing runners (but not dismantle) to "encourage" users to rebase. @ldionne Would you be OK with this approach? |
They shouldn't need to. Github PRs get tested as if they were merged into |
This is certainly true of some workflows. But IDK if it's true for this one. When we were testing this change it used the workflow from the PR. I also tried re-running the nightly build, and it used the old workflow. Here's a test PR I created: #148029 EDIT: So you're totally right. It's weird though. If you view the "workflow" file on the github actions page for the PR, you'll see the old runner names, but if you look at the runners actually being assigned, they're from the new runner groups. |
Since @boomanaiden154 corrected my mistake about which runners are used for old changes, I'll propose a new plan. I'm going to wait 24 hours, then disable the old runners entirely. I'll watch the CI over the weekend, and if the old runners become needed, I'll be ready to scale up. Then we can discuss when to nix the old runner setup entirely. |
@EricWF I won't oppose to your plan, however I would suggest being less aggressive and waiting for maybe 1 week until we touch anything. I don't think we lose anything from waiting a bit more, but we potentially make it easier to CTRL-Z in case things go wrong. Your decision though! |
Thanks a lot both @boomanaiden154 and @cmtice for getting this ready and merging this, I'm excited to see the new system in action! |
Update the premerge testing system to use the LLVM-wide premerge infrastructure. Also remove libcxx-restart-preempted-jobs.yaml, as this should no longer be needed.