Skip to content

Commit 29389c6

Browse files
authored
Refresh docs (#1378)
* Add filtering notebook * Update docs * Minor fixes by Claude * Clarify deduplication logic * Clarify which task group ID to use * Add a reference to the main guide
1 parent b104057 commit 29389c6

25 files changed

+2578
-252
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,16 +26,16 @@ An orchestrator is responsible for workflow management and parallelization.
2626

2727
- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI.
2828
It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability.
29-
[Usage instructions](docs/training/task-cluster.md).
29+
[Usage instructions](docs/infrastructure/task-cluster.md).
3030
- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that allows to run the pipeline locally or on a Slurm cluster.
31-
[Usage instructions](docs/training/snakemake.md). (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)
31+
[Usage instructions](docs/infrastructure/snakemake.md). (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)
3232

3333
## Experiment tracking
3434

3535
[Public training dashboard in Weights & Biases](https://wandb.ai/moz-translations/projects)
3636

3737
Marian training metrics are parsed from logs and published using a custom module within the `tracking` directory.
38-
More information is available [here](docs/training/tracking.md).
38+
More information is available [here](docs/infrastructure/tracking.md).
3939

4040
## Contributing
4141

docs/README.md

Lines changed: 47 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,68 @@
11
# Training Pipeline - mozilla/translations
22

3-
Training pipelines for Firefox Translations machine translation models.
3+
Training pipelines and the inference engine for Firefox Translations machine translation models.
44

5-
The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository,
6-
compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and
7-
power the Firefox web page translation starting with version 118.
5+
The trained models are hosted in a public Google Cloud Storage bucket (see Model Registry [UI](https://mozilla.github.io/translations/model-registry/) and [JSON](https://storage.googleapis.com/moz-fx-translations-data--303e-prod-translations-data/db/models.json)).
6+
The models are compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and
7+
power the Firefox web page translation starting with version 118.
88

99
The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.
1010

11-
## Training pipeline
11+
## Pipeline
1212

13-
The pipeline is capable of training a translation model for a language pair end to end.
14-
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters.
13+
The pipeline is capable of training a translation model for a language pair end to end.
14+
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters.
1515
Some settings, especially low resource languages might require extra tuning.
1616

17-
We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine .
17+
We use fast translation engine [Marian](https://marian-nmt.github.io).
1818

19-
## Learning resources
19+
See [more details about the pipeline steps](training/pipeline-steps.md).
20+
21+
## Orchestrators
22+
23+
An orchestrator is responsible for workflow management and parallelization.
24+
25+
- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI.
26+
It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability.
27+
[Usage instructions](infrastructure/task-cluster.md).
28+
- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that allows to run the pipeline locally or on a Slurm cluster.
29+
[Usage instructions](infrastructure/snakemake.md). (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)
30+
31+
## Experiment tracking
32+
33+
[Public training dashboard in Weights & Biases](https://wandb.ai/moz-translations/projects)
34+
35+
Marian training metrics are parsed from logs and published using a custom module within the `tracking` directory.
36+
More information is available [here](infrastructure/tracking.md).
37+
38+
## Contributing
39+
40+
Contributions are welcome! See the [documentation on Contributing](contributing/index.md) for more details.
41+
42+
Feel free to ask questions in our Matrix channel [#firefoxtranslations:mozilla.org](https://matrix.to/#/#firefoxtranslations:mozilla.org).
43+
44+
## Useful Links
2045

21-
- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
2246
- [Model training guide](training/README.md) - practical advice on how to use the pipeline
47+
- [High level overview post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
48+
- [The Training Pipeline DAG](https://docs.google.com/presentation/d/1HkypImI_hbA3n1ljU57ZPAzW8PuQqdv2wrXqj688KtQ/edit?slide=id.g3421e8f521e_1_419#slide=id.g3421e8f521e_1_419)
49+
- [Lightning Talk on the Training Pipeline Overview](https://www.youtube.com/watch?v=TfDEAYCeF6s)
50+
- [Model registry that shows all trained models](https://mozilla.github.io/translations/model-registry/)
51+
- [JSON with exported models](https://storage.googleapis.com/moz-fx-translations-data--303e-prod-translations-data/db/models.json)
52+
- [Models released to Firefox](https://mozilla.github.io/translations/firefox-models/)
53+
- [Final evaluation results](https://mozilla.github.io/translations/final-evals/)
54+
- [Running Experiments Dashboard](https://docs.google.com/spreadsheets/d/1Kiz9xUjo2jpeeVGtaL3jA_cLiCiiyz8GvIoQADMyYqo/edit?gid=0#gid=0)
55+
- Production bucket with models, training corpus, configs etc.: [moz-fx-translations-data--303e-prod-translations-data](https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data) - Uploaded models
56+
- [Documentation of the Firefox integration](https://firefox-source-docs.mozilla.org/toolkit/components/translations/index.html)
2357

2458
## Acknowledgements
59+
2560
This project uses materials developed by:
61+
2662
- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
2763
- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
2864
- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/))
29-
- Many other open source projects and research papers (see [References](README.md#references))
65+
- Many other open source projects and research papers (see [References](docs/README.md#references))
3066

3167
## References
3268

90 KB
Loading
214 KB
Loading

docs/contributing/development.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ It allows writing the steps in any language (currently it's historically mostly
8989
represent the pipeline as a directed acyclic graph (DAG).
9090

9191
The DAG of tasks can be launched using any workflow manager
92-
(currently we support only [Taskcluster](../training/task-cluster.md). [Snakemake](../training/snakemake.md) integration is unmaintained, but we accept contributions).
92+
(currently we support only [Taskcluster](../infrastructure/task-cluster.md). [Snakemake](../infrastructure/snakemake.md) integration is unmaintained, but we accept contributions).
9393
The workflow manager integration code should not include any training specific logic but rather implement it as a script
9494
in the `pipeline` directory.
9595

docs/contributing/index.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ One way to do this is by adding the dataset to the `skip_datasets` list, then it
4444
You can also use OpusCleaner to design custom cleaning rules for a dataset.
4545
See the examples of custom configs in the [/pipeline/clean/opuscleaner/configs](https://github.com/mozilla/translations/tree/main/pipeline/clean/opuscleaner/configs).
4646

47-
See also [documentation about OpusCleaner](https://mozilla.github.io/translations/cleaning.html#opuscleaner)
47+
See also [documentation about OpusCleaner](https://mozilla.github.io/translations/docs/data-and-cleaning)
4848

4949
The trick is to not filter too much. Unfortunately, it's hard to say how the filters will affect the translation quality without training the model.
5050

@@ -58,10 +58,11 @@ Issues labelled as ["help wanted"](https://github.com/mozilla/translations/label
5858
See [Development docs](development.md) to start with configuring local development environment.
5959

6060
Other ideas:
61+
6162
- Adding support for a new data importer
6263
- Writing tests (we are far from the full code coverage)
6364
- Contributing to the tools we use ([OpusCleaner](https://github.com/hplt-project/OpusCleaner), [OpusTrainer](https://github.com/hplt-project/OpusTrainer))
64-
- Helping to figure out how to run the pipeline locally (Either with Taskcluster, see [this issue](https://github.com/mozilla/translations/issues/403) or with updating [Snakemake](../training/snakemake.md))
65+
- Helping to figure out how to run the pipeline locally (Either with Taskcluster, see [this issue](https://github.com/mozilla/translations/issues/403) or with updating [Snakemake](../infrastructure/snakemake.md))
6566

6667
## ML engineers and researchers
6768

@@ -83,5 +84,5 @@ reach out to us on Matrix, and we'll consider your request.
8384

8485
The starting point is looking at the [model training guide](../training/README.md).
8586
Then you can generate training configs locally with configs generator and look at the datasets (it's described in the "Inspecting datasets" section).
86-
When the config is ready and you have a Taskcluster account, follow the [Taskcluster docs](../training/task-cluster.md) to run training.
87-
You can monitor the training with the Tascluster UI and see ML charts on [Weights and Biases dashboards](https://wandb.ai/moz-translations/projects).
87+
When the config is ready and you have a Taskcluster account, follow the [Taskcluster docs](../infrastructure/task-cluster.md) to run training.
88+
You can monitor the training with the Taskcluster UI and see ML charts on [Weights and Biases dashboards](https://wandb.ai/moz-translations/projects).

docs/data-and-cleaning/bicleaner.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ The classifier scores parallel sentences from 0 to 1 where 0 means a very noisy
1111
If a specialized model for a language pair is not available it will fallback to downloading a multilingual en-xx model.
1212

1313
For supported languages see:
14+
1415
* [Bicleaner AI Releases][ai-releases]
1516

1617
## How to configure for training

docs/data-and-cleaning/datasets.md

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Dataset importers
22

3-
Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/translations/tree/main/configs/config.test.yml).
3+
Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/translations/blob/main/taskcluster/configs/config.prod.yml).
44

55
Example:
66
```
@@ -9,24 +9,30 @@ Example:
99
- mtdata_newstest2014_ruen
1010
```
1111

12-
Data source | Prefix | Name examples | Type | Comments
13-
--- |------------| --- | ---| ---
14-
[MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
15-
[OPUS](https://opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
16-
[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
17-
[Flores](https://github.com/facebookresearch/flores) | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages.
18-
Custom parallel | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | corpus | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
19-
[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Monolingual news datasets from [WMT](https://www.statmt.org/wmt21/translation-task.html)
20-
[OPUS](https://opus.nlpl.eu/) | opus | tldr-pages/v2023-08-29 | mono | Monolingual dataset from OPUS.
21-
[HPLT](https://hplt-project.org/datasets/v2.0) | hplt | mono/v2.0 | mono | HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
22-
Custom mono | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst` | mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS.
12+
Data source | Prefix | Name examples | Type | Comments
13+
--- |------------|-----------------------------------------------------------------------------------------------|----------| ---
14+
[MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | parallel | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
15+
[OPUS](https://opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | parallel | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
16+
[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | parallel | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
17+
[Flores](https://github.com/facebookresearch/flores) | flores | dev, devtest | parallel | Evaluation dataset from Facebook that supports 100 languages.
18+
Custom parallel | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | parallel | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
19+
[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Monolingual news datasets from [WMT](https://www.statmt.org/wmt21/translation-task.html)
20+
[OPUS](https://opus.nlpl.eu/) | opus | tldr-pages/v2023-08-29 | mono | Monolingual dataset from OPUS.
21+
[HPLT](https://hplt-project.org/datasets/v2.0) | hplt | mono/v3.0 | mono | HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
22+
Custom mono | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst` | mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS.
2323

24-
You can also use [find-corpus](https://github.com/mozilla/translations/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
24+
You can also use [find-corpus](https://github.com/mozilla/translations/blob/main/utils/find_corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
2525

2626
Set up a local [poetry](https://python-poetry.org/) environment.
27-
```
27+
```bash
2828
task find-corpus -- en ru
2929
```
30+
31+
The config generator uses `find-corpus` to generate a training config automatically and include all the available datasets:
32+
```bash
33+
task config-generator -- ru en --name test
34+
```
35+
3036
Make sure to check licenses of the datasets before using them.
3137

3238
## Adding a new importer

docs/data-and-cleaning/index.md

Lines changed: 58 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ They will correspond to `opus_...` training datasets in the training pipeline co
2828

2929
Configure cleaning rules for the datasets in the UI.
3030

31-
Copy JSON files for the produced filters `data/train-parts/*.filter.json` to
32-
`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/` for langauge pair and dataset specific filters
33-
(such filters will also apply to the opposite langauge pair)
31+
Copy JSON files for the produced filters `data/train-parts/*.filter.json` to
32+
`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/` for language pair and dataset specific filters
33+
(such filters will also apply to the opposite language pair)
3434

3535
or to
3636

@@ -51,6 +51,7 @@ The `<src>` and `<trg>` in the template will be automatically replaced with the
5151
The generated default config will be copied to the target dataset cleaning directory.
5252

5353
The config is chosen based on this search order:
54+
5455
1. Dataset and language specific: `configs/<language-pair>/<dataset>.filter.json`
5556
2. Language specific: `configs/<language-pair>/default.filter.json`
5657
3. Dataset specific: `configs/<dataset>.filter.json`
@@ -61,7 +62,61 @@ The first found config will be applied.
6162
If the desired behaviour is to apply only the default config template and skip all possible custom configs
6263
for the current language pair and/or datasets, set `opuscleaner-mode: defaults`.
6364

65+
### Language codes
66+
67+
OpusCleaner uses many external tools in its filters. It means support of language code schemes for specific tools can differ.
68+
For some languages it's required to replace `<src>`, `<trg>` in the `filter.json` to the tools specific language codes.
69+
70+
For example, for Chinese Traditional we use:
71+
72+
- Pipeline code: `zh_hant`
73+
- It maps to OpusCleaner code: `zh_Hant`
74+
- It is changed in the [OpusCleaner config](https://github.com/mozilla/translations/blob/main/pipeline/clean/opuscleaner/configs/en-zh_hant/default.filters.json) for fasttext filter with openlid-v2 model to `cmn`:
75+
```json
76+
{
77+
"filter": "fasttext_filter",
78+
"parameters": {
79+
"FASTTEXT_MODEL_TYPE": "openlid-v2",
80+
"LANG1": "eng",
81+
"LANG2": "cmn"
82+
}
83+
```
84+
85+
See more details about the supported languages and language code mappings [here](../training/languages.md).
86+
6487
## Bicleaner
6588

6689
It is recommended to use Bicleaner ML models to filter noisy data.
6790
See the [bicleaner documentation](bicleaner.md) for more details on how to configure it.
91+
92+
93+
## Monolingual cleaning
94+
95+
Currently, it does not run OpusCleaner as monolingual filters are not fully supported.
96+
It runs legacy Bergamot cleaning scripts that include alphabet ratios and fast text.
97+
Also, it runs [Monocleaner](https://github.com/bitextor/monocleaner) to filter based on fluency.
98+
99+
Monocleaner thresholds can be adjusted in the training config:
100+
101+
```yaml
102+
# Monocleaner filters sentences in monolingual corpus based on language fluency
103+
# Use sanitized dataset names for compatibility with Taskcluster (replace ".", "/", ":", "[", "]" to "_")
104+
monocleaner:
105+
mono-src:
106+
# News-crawl is typically clean, enable on dataset by dataset basis
107+
default-threshold: 0.0
108+
dataset-thresholds:
109+
# We already filter it by document score, remove only the noisiest segments
110+
hplt_mono_v2_0: 0.5
111+
# Filter only garbage from NLLB
112+
opus_NLLB_v1: 0.5
113+
mono-trg:
114+
# News-crawl is typically clean, enable on dataset by dataset basis
115+
default-threshold: 0.0
116+
dataset-thresholds:
117+
# We already filter HPLT by document score, so it's relatively clean,
118+
# but let's still apply the default threshold for monocleaner to get more fluent target texts for back-translations
119+
hplt_mono_v2_0: 0.7
120+
# Sentences for back-translations should be in fluent language, apply even more aggressive threshold for NLLB
121+
opus_NLLB_v1: 0.8
122+
```

0 commit comments

Comments
 (0)