Skip to content

Commit 3dab86d

Browse files
committed
Check build completion solely on workflow complete webhook events
1 parent 80b2a1f commit 3dab86d

File tree

7 files changed

+117
-105
lines changed

7 files changed

+117
-105
lines changed

docs/design.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,13 @@ repository:
4949
(based on the `timeout` configured for the repository), it will cancel them.
5050
- Reload user permissions from the Team API.
5151
- Reload `rust-bors.toml` config for the repository from its main branch.
52+
- Reload the mergeability status of open PRs from GitHub.
53+
- Sync the status of PRs between the DB and GitHub.
54+
- Run the merge queue.
5255

5356
## Concurrency
5457
The bot is currently listening for GitHub webhooks concurrently, however it handles all commands serially, to avoid
55-
race conditions. This limitation is expected to be lifted in the future.
58+
race conditions. This limitation might be lifted in the future.
5659

5760
## Try builds
5861
A try build means that you execute a specific CI job on a PR (without merging the PR), to test if the job passes C
@@ -202,6 +205,13 @@ With [homu](https://github.com/rust-lang/homu) (the old bors implementation), Gi
202205
to use a "fake" job that marked the whole CI workflow as succeeded or failed, to signal to bors if it should consider
203206
the workflow to be OK or not.
204207

205-
The new bors uses a different approach. It asks GitHub which [check suites](https://docs.github.com/en/rest/checks/suites)
206-
are attached to a given commit, and then it waits until all of these check suites complete (or until a timeout is reached).
207-
Thanks to this approach, there is no need to introduce fake CI jobs.
208+
The new bors uses a different approach. In an earlier implementation, it used to ask GitHub which [check suites](https://docs.github.com/en/rest/checks/suites) were attached to a given commit, and then it waited until all of these check suites were completed (or until a timeout was reached). However, this was too "powerful" for rust-lang/rust, where we currently have only a single CI workflow per commit, and it was producing certain race conditions, where GitHub would send us a webhook that a check suite was completed, but when we then asked the GitHub API about the status of the check suite, it was still marked as pending.
209+
210+
To make the implementation more robust, it now behaves as follow:
211+
- We listen for the workflow completed webhook.
212+
- When it is received, we ask GitHub what are all the workflows attached to the workflow's check suite.
213+
- If all of them are completed, we mark the build as finished.
214+
215+
Note that we explicitly do not read the "Check suite was completed" webhook, because it can actually be received *before* a webhook that tells us that the last workflow of that check suite was completed. If that happens, we could mark a build as completed without knowing the final conclusion of its workflows. That is not a big problem, but it would mean that we sometimes cannot post the real status of a workflow in the "build completed" bors comment on GitHub. So instead we just listen for the completed workflows.
216+
217+
In any case, with new bors there is no need to introduce fake conclusion CI jobs.

src/bors/event.rs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,9 @@ pub enum BorsRepositoryEvent {
3131
/// when a branch is deleted or when a tag is deleted.
3232
PushToBranch(PushToBranch),
3333
/// A workflow run on Github Actions or a check run from external CI system has been started.
34-
WorkflowStarted(WorkflowStarted),
34+
WorkflowStarted(WorkflowRunStarted),
3535
/// A workflow run on Github Actions or a check run from external CI system has been completed.
36-
WorkflowCompleted(WorkflowCompleted),
36+
WorkflowCompleted(WorkflowRunCompleted),
3737
}
3838

3939
impl BorsRepositoryEvent {
@@ -162,7 +162,7 @@ pub struct PushToBranch {
162162
}
163163

164164
#[derive(Debug)]
165-
pub struct WorkflowStarted {
165+
pub struct WorkflowRunStarted {
166166
pub repository: GithubRepoName,
167167
pub name: String,
168168
pub branch: String,
@@ -173,7 +173,7 @@ pub struct WorkflowStarted {
173173
}
174174

175175
#[derive(Debug)]
176-
pub struct WorkflowCompleted {
176+
pub struct WorkflowRunCompleted {
177177
pub repository: GithubRepoName,
178178
pub branch: String,
179179
pub commit_sha: CommitSha,

src/bors/handlers/workflow.rs

Lines changed: 49 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -5,20 +5,19 @@ use octocrab::params::checks::CheckRunConclusion;
55
use octocrab::params::checks::CheckRunStatus;
66

77
use crate::PgDbClient;
8-
use crate::bors::CheckSuiteStatus;
9-
use crate::bors::RepositoryState;
108
use crate::bors::comment::{try_build_succeeded_comment, workflow_failed_comment};
11-
use crate::bors::event::{WorkflowCompleted, WorkflowStarted};
9+
use crate::bors::event::{WorkflowRunCompleted, WorkflowRunStarted};
1210
use crate::bors::handlers::is_bors_observed_branch;
1311
use crate::bors::handlers::labels::handle_label_trigger;
1412
use crate::bors::merge_queue::AUTO_BRANCH_NAME;
1513
use crate::bors::merge_queue::MergeQueueSender;
14+
use crate::bors::{RepositoryState, WorkflowRun};
1615
use crate::database::{BuildStatus, WorkflowStatus};
1716
use crate::github::LabelTrigger;
1817

1918
pub(super) async fn handle_workflow_started(
2019
db: Arc<PgDbClient>,
21-
payload: WorkflowStarted,
20+
payload: WorkflowRunStarted,
2221
) -> anyhow::Result<()> {
2322
if !is_bors_observed_branch(&payload.branch) {
2423
return Ok(());
@@ -67,7 +66,7 @@ pub(super) async fn handle_workflow_started(
6766
pub(super) async fn handle_workflow_completed(
6867
repo: Arc<RepositoryState>,
6968
db: Arc<PgDbClient>,
70-
mut payload: WorkflowCompleted,
69+
mut payload: WorkflowRunCompleted,
7170
merge_queue_tx: &MergeQueueSender,
7271
) -> anyhow::Result<()> {
7372
if !is_bors_observed_branch(&payload.branch) {
@@ -95,14 +94,17 @@ pub(super) async fn handle_workflow_completed(
9594
db.update_workflow_status(*payload.run_id, payload.status)
9695
.await?;
9796

98-
try_complete_build(repo.as_ref(), db.as_ref(), payload, merge_queue_tx).await
97+
maybe_complete_build(repo.as_ref(), db.as_ref(), payload, merge_queue_tx).await
9998
}
10099

101-
/// Try to complete a pending build.
102-
async fn try_complete_build(
100+
/// Attempt to complete a pending build after a workflow run has been completed.
101+
/// We assume that the status of the completed workflow run has already been updated in the
102+
/// database.
103+
/// We also assume that there is only a single check suite attached to a single build of a commit.
104+
async fn maybe_complete_build(
103105
repo: &RepositoryState,
104106
db: &PgDbClient,
105-
payload: WorkflowCompleted,
107+
payload: WorkflowRunCompleted,
106108
merge_queue_tx: &MergeQueueSender,
107109
) -> anyhow::Result<()> {
108110
if !is_bors_observed_branch(&payload.branch) {
@@ -134,41 +136,50 @@ async fn try_complete_build(
134136
return Ok(());
135137
};
136138

137-
// Ask GitHub what are all the check suites attached to the given commit.
138-
// This tells us for how many workflows we should wait.
139-
let checks = repo
140-
.client
141-
.get_check_suites_for_commit(&payload.branch, &payload.commit_sha)
142-
.await?;
139+
// Load the workflow runs that we know about from the DB. We know about workflow runs for
140+
// which we have received a started or a completed event.
141+
let mut db_workflow_runs = db.get_workflows_for_build(&build).await?;
142+
tracing::debug!("Workflow runs from DB: {db_workflow_runs:?}");
143143

144-
// Some checks are still running, let's wait for the next event
145-
if checks
146-
.iter()
147-
.any(|check| matches!(check.status, CheckSuiteStatus::Pending))
148144
{
149-
tracing::debug!("Some check suites are still pending: {checks:?}");
150-
return Ok(());
145+
// Ask GitHub about all workflow runs attached to the check suite of the completed workflow run.
146+
// This tells us for how many workflow runs we should wait.
147+
// We assume that this number is final, and after a single workflow run has been completed, no
148+
// other workflow runs attached to the check suite can appear out of nowhere.
149+
// Note: we actually only need the number of workflow runs in this function, but we download
150+
// some data about them to have better logging.
151+
let gh_workflow_runs: Vec<WorkflowRun> = repo
152+
.client
153+
.get_workflow_runs_for_check_suite(payload.check_suite_id)
154+
.await?;
155+
tracing::debug!("Workflow runs from GitHub: {gh_workflow_runs:?}");
156+
157+
// This could happen if a workflow run webhook is lost, or if one workflow run manages to finish
158+
// before another workflow run even manages to start. It should be rare.
159+
// We will wait for the next workflow run completed webhook.
160+
if db_workflow_runs.len() < gh_workflow_runs.len() {
161+
tracing::warn!("Workflow count mismatch, waiting for the next webhook");
162+
return Ok(());
163+
}
151164
}
152165

153-
let has_failure = checks
166+
// We have all expected workflow runs in the DB, but some of them are still pending.
167+
// Wait for the next workflow run to be finished.
168+
if db_workflow_runs
154169
.iter()
155-
.any(|check| matches!(check.status, CheckSuiteStatus::Failure));
156-
157-
let mut workflows = db.get_workflows_for_build(&build).await?;
158-
workflows.sort_by(|a, b| a.name.cmp(&b.name));
159-
160-
// If this happens, there is a race condition in GH webhooks and we haven't received a workflow
161-
// finished/failed event for some workflow yet. In this case, wait for that event before
162-
// posting the PR comment.
163-
if workflows.len() < checks.len()
164-
|| workflows
165-
.iter()
166-
.any(|w| w.status == WorkflowStatus::Pending)
170+
.any(|w| w.status == WorkflowStatus::Pending)
167171
{
168-
tracing::warn!("All checks are finished, but some workflows are still pending");
172+
tracing::info!("Some workflows are not finished yet, waiting for the next webhook.");
169173
return Ok(());
170174
}
171175

176+
// Below this point, we assume that the build has completed, because all workflow runs attached
177+
// to the corresponding check suite are completed.
178+
179+
let has_failure = db_workflow_runs
180+
.iter()
181+
.any(|check| matches!(check.status, WorkflowStatus::Failure));
182+
172183
let (status, trigger) = if has_failure {
173184
(BuildStatus::Failure, LabelTrigger::TryBuildFailed)
174185
} else {
@@ -194,12 +205,13 @@ async fn try_complete_build(
194205
}
195206
}
196207

208+
db_workflow_runs.sort_by(|a, b| a.name.cmp(&b.name));
197209
let message = if !has_failure {
198210
tracing::info!("Workflow succeeded");
199-
try_build_succeeded_comment(&workflows, payload.commit_sha, &build)
211+
try_build_succeeded_comment(&db_workflow_runs, payload.commit_sha, &build)
200212
} else {
201213
tracing::info!("Workflow failed");
202-
workflow_failed_comment(&workflows)
214+
workflow_failed_comment(&db_workflow_runs)
203215
};
204216
repo.client.post_comment(pr.number, message).await?;
205217

src/bors/mod.rs

Lines changed: 6 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ pub use command::RollupMode;
77
pub use comment::Comment;
88
pub use context::BorsContext;
99
pub use handlers::{handle_bors_global_event, handle_bors_repository_event};
10-
use octocrab::models::CheckSuiteId;
10+
use octocrab::models::RunId;
1111
use serde::Serialize;
1212

1313
use crate::config::RepositoryConfig;
@@ -25,6 +25,7 @@ mod handlers;
2525
pub mod merge_queue;
2626
pub mod mergeable_queue;
2727

28+
use crate::database::WorkflowStatus;
2829
pub use command::CommandPrefix;
2930

3031
#[cfg(test)]
@@ -39,19 +40,11 @@ pub static WAIT_FOR_PR_STATUS_REFRESH: TestSyncMarker = TestSyncMarker::new();
3940
#[cfg(test)]
4041
pub static WAIT_FOR_WORKFLOW_STARTED: TestSyncMarker = TestSyncMarker::new();
4142

42-
#[derive(Clone, Debug, PartialEq, Eq)]
43-
pub enum CheckSuiteStatus {
44-
Pending,
45-
Failure,
46-
Success,
47-
}
48-
49-
/// A GitHub check suite.
50-
/// Corresponds to a single GitHub actions workflow run, or to a single external CI check run.
43+
/// Corresponds to a single execution of a workflow.
5144
#[derive(Clone, Debug)]
52-
pub struct CheckSuite {
53-
pub id: CheckSuiteId,
54-
pub status: CheckSuiteStatus,
45+
pub struct WorkflowRun {
46+
pub id: RunId,
47+
pub status: WorkflowStatus,
5548
}
5649

5750
/// An access point to a single repository.

src/database/mod.rs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ pub enum WorkflowType {
387387
}
388388

389389
/// Status of a workflow.
390-
#[derive(Copy, Clone, Debug, PartialEq, sqlx::Type)]
390+
#[derive(Copy, Clone, Debug, PartialEq, Eq, sqlx::Type)]
391391
#[sqlx(type_name = "TEXT")]
392392
#[sqlx(rename_all = "lowercase")]
393393
pub enum WorkflowStatus {
@@ -400,6 +400,7 @@ pub enum WorkflowStatus {
400400
}
401401

402402
/// Represents a workflow run, coming either from Github Actions or from some external CI.
403+
#[derive(Debug)]
403404
pub struct WorkflowModel {
404405
pub id: PrimaryKey,
405406
/// The build this workflow is associated with.

src/github/api/client.rs

Lines changed: 39 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,17 @@ use std::fmt::Debug;
77
use tracing::log;
88

99
use crate::bors::event::PullRequestComment;
10-
use crate::bors::{CheckSuite, CheckSuiteStatus, Comment};
10+
use crate::bors::{Comment, WorkflowRun};
1111
use crate::config::{CONFIG_FILE_PATH, RepositoryConfig};
12-
use crate::database::RunId;
12+
use crate::database::{RunId, WorkflowStatus};
1313
use crate::github::api::base_github_html_url;
1414
use crate::github::api::operations::{
1515
ForcePush, MergeError, create_check_run, merge_branches, set_branch_to_commit, update_check_run,
1616
};
1717
use crate::github::{CommitSha, GithubRepoName, PullRequest, PullRequestNumber};
1818
use crate::utils::timing::{measure_network_request, perform_network_request_with_retry};
1919
use futures::TryStreamExt;
20+
use octocrab::models::workflows::Run;
2021
use serde::de::DeserializeOwned;
2122

2223
/// Provides access to a single app installation (repository) using the GitHub API.
@@ -200,62 +201,57 @@ impl GithubRepositoryClient {
200201
.await
201202
}
202203

203-
/// Find all check suites attached to the given commit and branch.
204-
pub async fn get_check_suites_for_commit(
204+
/// Find all workflows attached to a specific check suite.
205+
pub async fn get_workflow_runs_for_check_suite(
205206
&self,
206-
branch: &str,
207-
sha: &CommitSha,
208-
) -> anyhow::Result<Vec<CheckSuite>> {
207+
check_suite_id: CheckSuiteId,
208+
) -> anyhow::Result<Vec<WorkflowRun>> {
209209
#[derive(serde::Deserialize, Debug)]
210-
struct CheckSuitePayload {
211-
id: CheckSuiteId,
210+
struct WorkflowRunResponse {
211+
id: octocrab::models::RunId,
212+
status: String,
212213
conclusion: Option<String>,
213-
head_branch: String,
214214
}
215215

216216
#[derive(serde::Deserialize, Debug)]
217-
struct CheckSuiteResponse {
218-
check_suites: Vec<CheckSuitePayload>,
217+
struct WorkflowRunsResponse {
218+
workflow_runs: Vec<WorkflowRunResponse>,
219219
}
220220

221-
perform_network_request_with_retry("get_check_suites_for_commit", || async {
222-
let response: CheckSuiteResponse = self
223-
.get_request(&format!("commits/{}/check-suites", sha.0))
221+
perform_network_request_with_retry("get_workflows_for_check_suite", || async {
222+
// We use a manual query, because octocrab currently doesn't allow filtering by
223+
// check_suite_id when listing workflow runs.
224+
// Note: we don't handle paging here, as we don't expect to get more than 30 workflows
225+
// per check suite.
226+
let response: WorkflowRunsResponse = self
227+
.get_request(&format!("actions/runs?check_suite_id={check_suite_id}"))
224228
.await
225-
.context("Cannot fetch CheckSuiteResponse")?;
229+
.context("Cannot fetch workflow runs for a check suite")?;
226230

227-
let suites = response
228-
.check_suites
229-
.into_iter()
230-
.filter(|suite| suite.head_branch == branch)
231-
.map(|suite| CheckSuite {
232-
id: suite.id,
233-
status: match suite.conclusion {
234-
Some(status) => match status.as_str() {
235-
"success" => CheckSuiteStatus::Success,
236-
"failure" | "neutral" | "cancelled" | "skipped" | "timed_out"
237-
| "action_required" | "startup_failure" | "stale" => {
238-
CheckSuiteStatus::Failure
239-
}
240-
_ => {
241-
tracing::warn!(
242-
"Received unknown check suite conclusion for {}@{sha}: {status}",
243-
self.repo_name
244-
);
245-
CheckSuiteStatus::Pending
246-
}
247-
},
231+
fn get_status(run: &WorkflowRunResponse) -> WorkflowStatus {
232+
match run.status.as_str() {
233+
"completed" => match run.conclusion.as_deref() {
234+
Some("success") => WorkflowStatus::Success,
235+
Some(_) => WorkflowStatus::Failure,
248236
None => {
249-
tracing::warn!(
250-
"Received empty check suite conclusion for {}@{sha}",
251-
self.repo_name,
252-
);
253-
CheckSuiteStatus::Pending
237+
tracing::warn!("Received completed status with empty conclusion for workflow run {}", run.id);
238+
WorkflowStatus::Failure
254239
}
255240
},
241+
"failure" | "startup_failure" => WorkflowStatus::Failure,
242+
_ => WorkflowStatus::Pending
243+
}
244+
}
245+
246+
let runs = response
247+
.workflow_runs
248+
.into_iter()
249+
.map(|run| WorkflowRun {
250+
id: run.id,
251+
status: get_status(&run),
256252
})
257253
.collect();
258-
Ok(suites)
254+
Ok(runs)
259255
})
260256
.await?
261257
}

0 commit comments

Comments
 (0)