Commit 546a07b
authored
feat: automatically retry translate tasks when they can't find a GPU (#1058)
I had a translate task fail recently with:
```
[task 2025-03-08T05:47:06.300Z] + python3 /builds/worker/checkouts/vcs/pipeline/translate/translate.py --input /builds/worker/fetches/file.5.zst --models_glob '/builds/worker/fetches/*.npz' '/builds/worker/fetches/model*/*.npz' --artifacts /builds/worker/artifacts --vocab /builds/worker/fetches/vocab.spm --marian_dir '$MOZ_FETCHES_DIR' --gpus '0 1 2 3' --workspace 12000 --decoder ctranslate2 -- --maxi-batch 10000 --mini-batch-words 5000
[task 2025-03-08T05:47:06.633Z] [translate] Input file: /builds/worker/fetches/file.5.zst
[task 2025-03-08T05:47:06.633Z] [translate] Output file: /builds/worker/artifacts/file.5.out.zst
[task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] Converting the Marian model to Ctranslate2:
[task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] /builds/worker/fetches/model1/final.model.npz.best-chrf.npz
[task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] Outputing model to:
[task 2025-03-08T05:47:06.750Z] [translate_ctranslate2] /builds/worker/fetches/model1/final.model.npz.best-chrf
[task 2025-03-08T05:47:06.758Z] [translate_ctranslate2] Loading vocab:
[task 2025-03-08T05:47:06.758Z] [translate_ctranslate2] /builds/worker/fetches/vocab.spm
[task 2025-03-08T05:47:12.792Z] Traceback (most recent call last):
[task 2025-03-08T05:47:12.792Z] File "/builds/worker/checkouts/vcs/pipeline/translate/translate.py", line 256, in <module>
[task 2025-03-08T05:47:12.792Z] main()
[task 2025-03-08T05:47:12.792Z] File "/builds/worker/checkouts/vcs/pipeline/translate/translate.py", line 187, in main
[task 2025-03-08T05:47:12.792Z] translate_with_ctranslate2(
[task 2025-03-08T05:47:12.792Z] File "/builds/worker/checkouts/vcs/pipeline/translate/translate_ctranslate2.py", line 156, in translate_with_ctranslate2
[task 2025-03-08T05:47:12.792Z] translator = ctranslate2.Translator(
[task 2025-03-08T05:47:12.792Z] RuntimeError: CUDA failed with error no CUDA-capable device is detected
```
We can and should automatically retry for these cases. This is similar to the work done in #722 for bicleaner.1 parent f79917d commit 546a07b
File tree
4 files changed
+16
-4
lines changed- pipeline/translate
- taskcluster/kinds
- translate-corpus
- translate-mono-src
- translate-mono-trg
4 files changed
+16
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
253 | 254 | | |
254 | 255 | | |
255 | 256 | | |
256 | | - | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
106 | | - | |
| 106 | + | |
| 107 | + | |
107 | 108 | | |
108 | 109 | | |
109 | 110 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
106 | | - | |
| 106 | + | |
| 107 | + | |
107 | 108 | | |
108 | 109 | | |
109 | 110 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
107 | | - | |
| 107 | + | |
| 108 | + | |
108 | 109 | | |
109 | 110 | | |
110 | 111 | | |
| |||
0 commit comments