feat: upgrade cpu workers to ubuntu 24.04 generic-worker image #949

bhearsum · 2024-11-27T18:56:53Z

This will remove docker-worker from translations, and use the most recent generic-worker version instead. This halfway unblocks #375 and #466 (the other half is upgrading the GPU workers to this image, which is still in progress).

The other notable part of this is that it moves these workers to a common shared image that will be updated regularly. This means that we'll regularly be picking up things like Taskcluster version upgrades. There are risks to this of course, in that any change to the image can theoretically cause issues with tasks that use it. Over in mozilla-releng/fxci-config#192, I'm working on adding the ability for us to run the training pipeline to help validate new images before they land to help minimize this.

We may also want to consider pinning to a specific image the next time we cut a release branch to minimize risk/change during long training sessions.

(If someone wants to make the case that we should always pin to a specific image rather than a rolling one, we can consider that.)

bhearsum · 2024-12-02T17:29:02Z

This change originally included a forced cache bust to make sure the entire pipeline ran, which I've since removed to avoid taking it on main. That graph is https://firefox-ci-tc.services.mozilla.com/tasks/EyBbWlbmTpCQOT0Lwtp7pw, and confirms that the pipeline runs fine on the new image.

gregtatum

Looks reasonable to me, but I'd like @eu9ene's review as well.

taskcluster/config.yml

This will remove docker-worker from translations, and use the most recent generic-worker version instead.

We switched these workers to new worker pools in mozilla/translations#949. Unless there's some older branches still used for training that haven't picked up this patch, these are now unused.

* Remove num_mismatch filter for Chinese It is removing a lot of sentences that have exactly the same numbers. Seems that is happenning when the Chinese side has the numbers glued to the Chinese characters. * Remove displaystyle in WikiMatrix * Normalize to full-width punctuation * Add ICU tokenizer * Use ICU tokenizer in alignments * Update to OpusTrainer with ICU detokenization support * Update docs * Add pyicu pypi package * Use ICU system package * Add zh-en config * Do not omit the character preceeding the period * Roll back target renaming * Fix config * Disable php * Disable teacher ensemble * Rename experiment * Rename kind * Update configs * Use OpusCleaner with ja support * Fix new lines * Disable hplt * Fix parser * Fix new lines * Do not output empty alignments * Fix parser * Start from teacher * Use gcp standard for teacher * Fix merging * Add teacher decoder setting * Disable hplt * Add start group * Update configs * Disable PHP * Fix config * Switch to student base * Add en-xx CJK configs * Add OpusFilter requirement * Remove hashes * Add autogenerated configs * Fix student id * Add Japanese and Korean tokenizers for Sacrebleu * Add previous groups * Use `Intl.Segmenter` instead of `ssplit` for segmentation in WASM builds (#945) * Fix inference in CODEOWNERS file This fixes an oversight from a previous PR when `inference-engine` was renamed to `inference`, however the path was not updated in `CODEOWNERS`. * Improve eslint string-formatting configuration This is a miscellaneous change to the eslint config that now allows different string types based on whether certain types of quotes need to be escaped within the string. * Add a --force-rebuild flag to WASM build commands This commit adds a --force-rebuild flag to the WASM build commands that will trigger a rebuild without having to fully clobber and start over. * Fix misc. formatting in build-bergamot.py This commit fixes miscellaneous formatting that I noticed looked misaligned in the terminal. For some reason, some emojis need two spaces after them, when other emojis only need one space to achieve the same alignment. * Rename `appendEndingWhitespace` This commit renames `appendEndingWhitespace` to `handleEndingWhitespace`, because the whitespace logic will be made more complex by this PR, and whitspace is no longer guaranteed to be appended. * Add capability to register languages This commit adds the capability for several of the C++ classes to register either a source language tag or a target language tag (depending on their needs). I had experimented with changing the constructors themselves, but mtaintaining backward compatibility got messy very quickly with native builds continuing to use `ssplit` and WASM builds now using `Intl.Segmenter`. The least-invasive and cleanest-to-implement compromise that I came up with was to add WASM-specific functionality to register the language tags for classes after construction. * Implement WASM segmentation with `Intl.Segmenter` This is the largest commit of the stack, and likely the one to pay the most attention to. In addition to utilizing `Intl.Segmenter` instead of `ssplit` when segmenting text in WASM builds, this patch also necessarily modifies the logic of how whitespce is handled during translations. We now have to concern ourselves with whether the source language and/or target language utilize whitespace between sentences or omit whitespace between sentences. For example: * When translating from Chinese to English, then whitespace must be added between sentences. * When translating from English to Chinese, then whitespace must be removed between sentences. * When translating form Chinese to Japanese, then whitesapce must be inserted between sentences for the English pivot, and then removed for the final output. * Remove WASM dependency on ssplit This commit entirely removes the build dependency on `ssplit` when building the WASM target. This actually ultimately reduces the size of the compiled WASM binary from 5.01 MB to 4.73 MB. * Bump Bergamot Version 0.4.5 => 0.5.0 * Update WASM Bindings Part 1 of 2 This commit updates the WASM bindings to take the source language and target language tags in order to construct the TranslationModels that now utilize the locale-specific `Intl.Segmenter`. This effectively takes the `LanguageTranslationModelFiles` object and makes that a sub-object of `TranslationModelPayload`, which includes the language tags as well as the files. This hierarchical separation is ideal, because the `LanguageTranslationModelFiles` object is designed to be iterated over and chunked into aligned memory, where as the language tags are plain strings that are distinctly separate in the way that they are handled. * Rework TranslationsEngine to utilize new bindings Part 2 of 2 This commit reworks the TranslationsEngine worker code to utilize the new bindings implemented in the previous commit. * Insert whitespace between full-width punctuation and opening quotes This commit introduces extra logic to the text cleaning that purposely inserts whitespace into CJK text to trick the segmenter into doing the right thing. See the in-code comment for more context. * Add `zhen` test model files This commit adds our work-in-progress `zhen` model to the repository for use in testing. * Add test cases for testing `zhen` models. This commit adds several test cases for translating from Chinese into other languages, which will both guard against regressions and demonstrate correct segmentation behavior. * Add temporary `enzh` models for testing Part 1 of 2 The final two commits of this stack may be slightly controversial. We do not currently have a viable `enzh` model, even for testing purposes, however, I need to test the functionality of removing whitespace between sentences for target languages that require it. This patch adds our `enes` models under the `enzh` directory, which will trick the implementation into translating into "Chinese" with a Spanish output. The key difference is that the Spanish output should not include spaces between sentences, which is, in my opinion, good enough for testing in the interim. * Add makeshift `enzh` tests Part 2 of 2 This patch adds test cases for translating into "Chinese", which at present, is actually a Spanish translation that omits spaces between sentences. * fix: use base repo name for project in pull requests (#954) The rename of this repository has exposed that we rely on fork names to set `project` in PRs. This causes issues in certain circumstances, eg: if we try to fire an action in a PR where the head repository is not named the same, we end up with [the wrong treeherder routes](https://github.com/mozilla/translations/blob/f9010478a45cd40bf7ad3d0aecdd62dd281ec5d6/.taskcluster.yml#L146), which causes scope issues. * fix: set permission for train action (#953) * fix: set permission for train action This used to default to the action name, but changed in taskgraph 11.0. Without this, we try to use the `generic` permission, which doesn't have the scopes needed. * chore: bump taskgraph in poetry requirements * Add pyright to check python types (#952) * Add pyright as a dependency * Add pyright to Taskfile and to CI * Add the utils folder to be type checked * Fix the dependencies for type checking * Fix space omission when translating into CJK languages (#956) This PR fixes an issue where we are checking for equality on the language tag for CJK languages, when we really need to be checking if the tag starts with the language tag. Before this PR `zh-Hans` would not match. After this PR it will match. * Add a warning when train- is the target stage rather than evalute- (#951) * Remove Korean from space-omission languages (#958) This commit removes Korean from the list of languages that will omit space between sentences when being translated into as a target language. * Rollback target renaming (#960) * Rename kind * Roll back target renaming * feat: upgrade cpu workers to ubuntu 24.04 generic-worker image (#949) This will remove docker-worker from translations, and use the most recent generic-worker version instead. * Remove expired-after (#968) * Rewrite the train scripts and add config support for ctranslate2 (#922) * WANDB Test failure * Rename DataDir.load to DataDir.read_text and allow for reading compressed files * Add compress and decompress common utilities * Use decompression utilities everywhere * Re-work the marian-decoder fixture to correctly output nbest * Rewrite translate.sh to python * Add a requirements file for ctranslate2 * Add support for ctranslate2 * Add gpustats to the train requirements * Add logging for translations * Remove old translate scripts * Handle review feedback * Do not output empty alignments (#963) * Switch to ICU tokenizer (#939) * Add ICU tokenizer * Use ICU tokenizer in alignments * Update to OpusTrainer with ICU detokenization support * Update docs * Add pyicu pypi package * Use ICU system package * Strip new lines * Refactor abstract classes * Fix typo * Add todo with issue link for OpusTrainer package * Add test cases from sacremoses * Update to the latest commit * Relock poetry * Rename "all" to "all-pipeline" (#962) * Add better support for reporting training continuation values (#917) * Add a util for local remote settings (#955) * feat: add cron task that runs the minimal training pipeline nightly (#966) * Contributing docs (#973) * Add contributing guide * Add a link from the readme * Fix link formatting * Clarify how to skip a dataset * Fix links (#975) * use a hopefully more reliable mtdata dataset in ci config (#980) The current mtdata dataset here has been unavailable for many days in a row, making it difficult to verify pipeline changes. * fix: ensure cuda toolkit toolchain only includes one version of the toolkit (#982) * Make Marian log parser more reliable (#989) * Update Python packages for CJK support (#990) * Update OpusCleaner * Update OpusTrainer * Update OpusCleaner * Add Japanese and Korean to sacrebleu * Add opusfilter package * Update development docs (#988) * Update development docs * Fix the note on Snakemake * Update model training guide (#984) * Update model training guide * Update docs/training-guide.md Co-authored-by: Greg Tatum <[email protected]> * Clarify language resource classification --------- Co-authored-by: Greg Tatum <[email protected]> * Disable num-mismatch for Japanes (same reason as Chinese) * Normalize Japanese punctuation Only for en->ja * Disable num_mismatch for Korean * Update HPLT importer to 2.0 * Update training configuration * Update docs * Replace scores to floats * Move packages required for PyICU to the base image * Change default cleaning thresholds * Update HPLT tests * Fix naming issue * Add more statistics and refactor * Update and extract tests * Create taskgraph directory if it doesn't exit * Update configs * Fix HPLT Korean locale * Update configs * Update configs * Update configs * Revert "Create taskgraph directory if it doesn't exit" This reverts commit 094cde0. * Reimplement closing * Fix Korean locale * Add CJK one stage config * Update configs * Update config * Fix train action * Experiment with 50 50 split * Rollback back-translated ratio * Fix extract best * Update config * Fix extract best * Update configs * Remove temporary directory * Update configs * Update config * Update config * Filter out Traditional when Chinese is a target language * Filter out Traditional when Chinese is a target language in mono corpus * Fix handling parallel corpus * Update configs * Split vocabs 1 * Update config * Split vocabs 2 * Update config * Fix export * Fix test * Temporarily disable split vocabs for ce-filter scoring * Attempt to fix train action * Split vocabs 1 * Split vocabs 2 * Fix export * Fix test * Add an option for split vocabs * Remove todo * Output a joint vocab * Output a joint vocab * Update docs * Handle tied embeddings * Update OpusTrainer * Use tuple * Fix formatting * Use vocabs content for comparison * Support single vocab for using pretrained models * Fix linter issue * Save shortlist artifacts * Debug marian evals * Update config * Remove uncompressed corpus * Add logging * Revert "Remove uncompressed corpus" This reverts commit c417884. * Revert "Save shortlist artifacts" This reverts commit c34ef8e. * Add extra export and var check * Use artifacts dir * Update config * en-zh-bible-uedin desegmentation fix * Aplly desegmentation in the opposite direction and fix filters.json format * Update configs * Remove shortlist arg * Disable installs * Fix configuration * Update configs * Update config * Update config * Revert changes * Add the student base-memory configuration (#1113) (cherry picked from commit 5d33b26) * Update configs for base memory --------- Co-authored-by: ZJaume <[email protected]> Co-authored-by: Erik Nordin <[email protected]> Co-authored-by: Ben Hearsum (he/him) <[email protected]> Co-authored-by: Greg Tatum <[email protected]>

bhearsum force-pushed the cpu-only-upgrade branch from 01dc536 to db2dc7c Compare December 2, 2024 17:24

bhearsum marked this pull request as ready for review December 2, 2024 17:29

bhearsum requested review from a team as code owners December 2, 2024 17:29

bhearsum requested review from gabrielbusta and removed request for a team and gabrielbusta December 2, 2024 17:29

This was referenced Dec 2, 2024

docker tasks on generic worker sometimes hit issues with caches #538

Closed

CI often fails with "Could not resolve host: github.com" #549

Closed

gregtatum approved these changes Dec 3, 2024

View reviewed changes

gregtatum requested a review from evgenyrp December 3, 2024 15:45

evgenyrp reviewed Dec 13, 2024

View reviewed changes

taskcluster/config.yml Outdated Show resolved Hide resolved

bhearsum force-pushed the cpu-only-upgrade branch from db2dc7c to 0d22108 Compare December 13, 2024 13:59

evgenyrp approved these changes Dec 16, 2024

View reviewed changes

feat: upgrade cpu workers to ubuntu 24.04 generic-worker image

c6a36ed

This will remove docker-worker from translations, and use the most recent generic-worker version instead.

bhearsum force-pushed the cpu-only-upgrade branch from 0d22108 to c6a36ed Compare December 17, 2024 15:15

bhearsum merged commit 70307e3 into mozilla:main Dec 17, 2024
9 checks passed

ZJaume pushed a commit that referenced this pull request Jan 16, 2025

feat: upgrade cpu workers to ubuntu 24.04 generic-worker image (#949)

e61e73c

This will remove docker-worker from translations, and use the most recent generic-worker version instead.

bhearsum mentioned this pull request Mar 25, 2025

chore: remove unused translations non-d2g cpu workers mozilla-releng/fxci-config#324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upgrade cpu workers to ubuntu 24.04 generic-worker image #949

feat: upgrade cpu workers to ubuntu 24.04 generic-worker image #949

Uh oh!

bhearsum commented Nov 27, 2024 •

edited

Loading

Uh oh!

bhearsum commented Dec 2, 2024

Uh oh!

gregtatum left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: upgrade cpu workers to ubuntu 24.04 generic-worker image #949

feat: upgrade cpu workers to ubuntu 24.04 generic-worker image #949

Uh oh!

Conversation

bhearsum commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhearsum commented Dec 2, 2024

Uh oh!

gregtatum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bhearsum commented Nov 27, 2024 •

edited

Loading