Skip to content

Commit 0c43cd6

Browse files
committed
jobs/build: wait until build-arch takes lock after starting
This solves a potential race between triggering the build-arch and releaes job where the latter may take the release locks before the build-arch does. This wasn't likely before (though still theoretically possible) since we did many things between triggering the two jobs. But now in the new "complete the previous build" path, they're triggered one after the other so the risk is much higher. The technique here isn't foolproof. If the job fails early on (e.g. `git clone` failure), we'll sit there waiting for something that'll never happen. To counter this, add a timeout. We don't make it fatal because we still want the semantic of a "best-effort release" to apply.
1 parent 243b659 commit 0c43cd6

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed

jobs/build.Jenkinsfile

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import org.yaml.snakeyaml.Yaml;
2+
import org.jenkinsci.plugins.workflow.steps.FlowInterruptedException;
23

34
node {
45
checkout scm
@@ -518,6 +519,14 @@ def run_multiarch_jobs(arches, src_commit, version, cosa_img) {
518519
string(name: 'PIPECFG_HOTFIX_REPO', value: params.PIPECFG_HOTFIX_REPO),
519520
string(name: 'PIPECFG_HOTFIX_REF', value: params.PIPECFG_HOTFIX_REF)
520521
]
522+
// Wait until the locks taken by the `build-arch` jobs are taken
523+
// before continuing. This closes a potential race in which once we
524+
// trigger the `release` job afterwards, it could end up taking the
525+
// locks before the multi-arch jobs.
526+
// This really should never take more than 5 minutes. Having a
527+
// timeout ensures we don't wait for a long time if we somehow
528+
// missed the transition.
529+
wait_until_locked_or_continue("release-${version}-${arch}", 5)
521530
}]}
522531
}
523532
}
@@ -538,3 +547,24 @@ def run_release_job(buildID) {
538547
]
539548
}
540549
}
550+
551+
// XXX: generalize and put in coreos-ci-lib eventually
552+
def wait_until_locked_or_continue(resource, timeout_mins) {
553+
try {
554+
timeout(time: timeout_mins, unit: 'MINUTES') {
555+
waitUntil {
556+
lock(resource: resource, skipIfLocked: true) {
557+
return false
558+
}
559+
return true
560+
}
561+
}
562+
} catch (FlowInterruptedException e) {
563+
// If the lock was still not taken, then something went wrong. For
564+
// example, the job might've failed during the initial `git clone`. The
565+
// timeout is to ensure we don't wait forever and here we continue to
566+
// try to at least release for the arches that did succeed. We may be
567+
// able to salvage the failed arch in the next run.
568+
echo "Timed out waiting for lock ${resource} to be taken. Continuing..."
569+
}
570+
}

0 commit comments

Comments
 (0)