Skip to content

Commit 546a07b

Browse files
authored
feat: automatically retry translate tasks when they can't find a GPU (#1058)
I had a translate task fail recently with: ``` [task 2025-03-08T05:47:06.300Z] + python3 /builds/worker/checkouts/vcs/pipeline/translate/translate.py --input /builds/worker/fetches/file.5.zst --models_glob '/builds/worker/fetches/*.npz' '/builds/worker/fetches/model*/*.npz' --artifacts /builds/worker/artifacts --vocab /builds/worker/fetches/vocab.spm --marian_dir '$MOZ_FETCHES_DIR' --gpus '0 1 2 3' --workspace 12000 --decoder ctranslate2 -- --maxi-batch 10000 --mini-batch-words 5000 [task 2025-03-08T05:47:06.633Z] [translate] Input file: /builds/worker/fetches/file.5.zst [task 2025-03-08T05:47:06.633Z] [translate] Output file: /builds/worker/artifacts/file.5.out.zst [task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] Converting the Marian model to Ctranslate2: [task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] /builds/worker/fetches/model1/final.model.npz.best-chrf.npz [task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] Outputing model to: [task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] /builds/worker/fetches/model1/final.model.npz.best-chrf [task 2025-03-08T05:47:06.758Z] [translate_ctranslate2] Loading vocab: [task 2025-03-08T05:47:06.758Z] [translate_ctranslate2] /builds/worker/fetches/vocab.spm [task 2025-03-08T05:47:12.792Z] Traceback (most recent call last): [task 2025-03-08T05:47:12.792Z] File "/builds/worker/checkouts/vcs/pipeline/translate/translate.py", line 256, in <module> [task 2025-03-08T05:47:12.792Z] main() [task 2025-03-08T05:47:12.792Z] File "/builds/worker/checkouts/vcs/pipeline/translate/translate.py", line 187, in main [task 2025-03-08T05:47:12.792Z] translate_with_ctranslate2( [task 2025-03-08T05:47:12.792Z] File "/builds/worker/checkouts/vcs/pipeline/translate/translate_ctranslate2.py", line 156, in translate_with_ctranslate2 [task 2025-03-08T05:47:12.792Z] translator = ctranslate2.Translator( [task 2025-03-08T05:47:12.792Z] RuntimeError: CUDA failed with error no CUDA-capable device is detected ``` We can and should automatically retry for these cases. This is similar to the work done in #722 for bicleaner.
1 parent f79917d commit 546a07b

File tree

4 files changed

+16
-4
lines changed

4 files changed

+16
-4
lines changed

pipeline/translate/translate.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from glob import glob
88
import os
99
from pathlib import Path
10+
import sys
1011
import tempfile
1112

1213
from pipeline.common.command_runner import apply_command_args, run_command
@@ -253,4 +254,12 @@ def main() -> None:
253254

254255

255256
if __name__ == "__main__":
256-
main()
257+
try:
258+
main()
259+
except RuntimeError as e:
260+
# On GCP instances, we occasionally find that a GPU is not found even
261+
# when it has been requested. Exiting with a unique error code in these
262+
# cases allows us to automatically retry such tasks in Taskcluster.
263+
if len(e.args) > 0 and "no CUDA-capable device is detected" in e.args[0]:
264+
logger.exception("couldn't find GPU, exiting with 9002")
265+
sys.exit(9002)

taskcluster/kinds/translate-corpus/kind.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ tasks:
103103
CUDNN_DIR: fetches/cuda-toolkit
104104
MARIAN: $MOZ_FETCHES_DIR
105105
# 128 happens when cloning this repository fails
106-
retry-exit-status: [128]
106+
# 9002 happens if no GPU is attached
107+
retry-exit-status: [128, 9002]
107108

108109
run:
109110
using: run-task

taskcluster/kinds/translate-mono-src/kind.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ tasks:
103103
CUDNN_DIR: fetches/cuda-toolkit
104104
MARIAN: $MOZ_FETCHES_DIR
105105
# 128 happens when cloning this repository fails
106-
retry-exit-status: [128]
106+
# 9002 happens if no GPU is attached
107+
retry-exit-status: [128, 9002]
107108

108109
marian-args:
109110
from-parameters: training_config.marian-args.decoding-teacher

taskcluster/kinds/translate-mono-trg/kind.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,8 @@ tasks:
104104
CUDNN_DIR: fetches/cuda-toolkit
105105
MARIAN: $MOZ_FETCHES_DIR
106106
# 128 happens when cloning this repository fails
107-
retry-exit-status: [128]
107+
# 9002 happens if no GPU is attached
108+
retry-exit-status: [128, 9002]
108109

109110
# Don't run unless explicitly scheduled
110111
run-on-tasks-for: []

0 commit comments

Comments
 (0)