Read exit code from file for successful Google Batch jobs to avoid intermediate states#6848
Open
thalassemia wants to merge 1 commit intonextflow-io:masterfrom
Open
Read exit code from file for successful Google Batch jobs to avoid intermediate states#6848thalassemia wants to merge 1 commit intonextflow-io:masterfrom
thalassemia wants to merge 1 commit intonextflow-io:masterfrom
Conversation
…ates Signed-off-by: Sean Cheah <cheah_sean@yahoo.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
d9fa5cd to
d752bc2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When
google.batch.maxSpotAttemptsis set to a value greater than 0, Google Batch handles retrying of jobs on VMs that fail with exit code 50001 (spot preemption). While retrying, the job continues to stay in aRUNNINGstate. Once the job finishes, Batch marks the job asSUCCEEDED, which triggers the block of code I modified in this PR.Even though the
getExitCodefunction is supposed to read all task exit codes and pick the most recent one, I frequently found that it picks up the 50001 exit code instead of the final exit code for jobs internally retried due to preemption. This causes the workflow to fail if the 50001 exit code is not handled in Nextflow as well, which defeats the purpose of letting Batch handle it. In all these cases, the.exitcodedoes have the correct final exit code of 0 (job was successful after all). Thus, to handle this case, I propose always reading from.exitcodefor successful jobs as it appears to be more reliable than the Batch API when there is a preemption event.Example messages in
.nextflow.log