mozilla
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/README.md‎
Lines changed: 47 additions & 11 deletions b/‎docs/README.md‎
Lines changed: 47 additions & 11 deletions
diff --git a/‎docs/assets/tracking/wandb_group_by_group.png‎
90 KB b/‎docs/assets/tracking/wandb_group_by_group.png‎
90 KB
diff --git a/‎docs/assets/tracking/wandb_no_grouping.png‎
214 KB b/‎docs/assets/tracking/wandb_no_grouping.png‎
214 KB
diff --git a/‎docs/contributing/development.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/contributing/development.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/contributing/index.md‎
Lines changed: 5 additions & 4 deletions b/‎docs/contributing/index.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎docs/data-and-cleaning/bicleaner.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/data-and-cleaning/bicleaner.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/data-and-cleaning/datasets.md‎
Lines changed: 20 additions & 14 deletions b/‎docs/data-and-cleaning/datasets.md‎
Lines changed: 20 additions & 14 deletions
diff --git a/‎docs/data-and-cleaning/index.md‎
Lines changed: 58 additions & 3 deletions b/‎docs/data-and-cleaning/index.md‎
Lines changed: 58 additions & 3 deletions
@@ -26,16 +26,16 @@ An orchestrator is responsible for workflow management and parallelization.
 
 - [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI.
   It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability.
-  [Usage instructions](docs/training/task-cluster.md).
+  [Usage instructions](docs/infrastructure/task-cluster.md).
 - [Snakemake](https://snakemake.github.io/) - a file based orchestrator that allows to run the pipeline locally or on a Slurm cluster.
-  [Usage instructions](docs/training/snakemake.md). (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)
+  [Usage instructions](docs/infrastructure/snakemake.md). (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)
 
 ## Experiment tracking
 
 [Public training dashboard in Weights & Biases](https://wandb.ai/moz-translations/projects)
 
 Marian training metrics are parsed from logs and published using a custom module within the `tracking` directory.
-More information is available [here](docs/training/tracking.md).
+More information is available [here](docs/infrastructure/tracking.md).
 
 ## Contributing
 
 
@@ -1,32 +1,68 @@
 # Training Pipeline - mozilla/translations
 
-Training pipelines for Firefox Translations machine translation models.
+Training pipelines and the inference engine for Firefox Translations machine translation models.
 
-The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository,
-compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and 
-power the Firefox web page translation starting with version 118. 
+The trained models are hosted in a public Google Cloud Storage bucket (see Model Registry [UI](https://mozilla.github.io/translations/model-registry/) and [JSON](https://storage.googleapis.com/moz-fx-translations-data--303e-prod-translations-data/db/models.json)).
+The models are compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and
+power the Firefox web page translation starting with version 118.
 
 The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project  that focuses on improving client-side machine translation in a web browser.
 
-## Training pipeline
+## Pipeline
 
-The pipeline is capable of training a translation model for a language pair end to end. 
-Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. 
+The pipeline is capable of training a translation model for a language pair end to end.
+Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters.
 Some settings, especially low resource languages might require extra tuning.
 
-We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine .
+We use fast translation engine [Marian](https://marian-nmt.github.io).
 
-## Learning resources
+See [more details about the pipeline steps](training/pipeline-steps.md).
+
+## Orchestrators
+
+An orchestrator is responsible for workflow management and parallelization.
+
+- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI.
+  It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability.
+  [Usage instructions](infrastructure/task-cluster.md).
+- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that allows to run the pipeline locally or on a Slurm cluster.
+  [Usage instructions](infrastructure/snakemake.md). (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)
+
+## Experiment tracking
+
+[Public training dashboard in Weights & Biases](https://wandb.ai/moz-translations/projects)
+
+Marian training metrics are parsed from logs and published using a custom module within the `tracking` directory.
+More information is available [here](infrastructure/tracking.md).
+
+## Contributing
+
+Contributions are welcome! See the [documentation on Contributing](contributing/index.md) for more details.
+
+Feel free to ask questions in our Matrix channel [#firefoxtranslations:mozilla.org](https://matrix.to/#/#firefoxtranslations:mozilla.org).
+
+## Useful Links
 
-- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
 - [Model training guide](training/README.md) - practical advice on how to use the pipeline
+- [High level overview post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
+- [The Training Pipeline DAG](https://docs.google.com/presentation/d/1HkypImI_hbA3n1ljU57ZPAzW8PuQqdv2wrXqj688KtQ/edit?slide=id.g3421e8f521e_1_419#slide=id.g3421e8f521e_1_419)
+- [Lightning Talk on the Training Pipeline Overview](https://www.youtube.com/watch?v=TfDEAYCeF6s)
+- [Model registry that shows all trained models](https://mozilla.github.io/translations/model-registry/)
+- [JSON with exported models](https://storage.googleapis.com/moz-fx-translations-data--303e-prod-translations-data/db/models.json)
+- [Models released to Firefox](https://mozilla.github.io/translations/firefox-models/)
+- [Final evaluation results](https://mozilla.github.io/translations/final-evals/)
+- [Running Experiments Dashboard](https://docs.google.com/spreadsheets/d/1Kiz9xUjo2jpeeVGtaL3jA_cLiCiiyz8GvIoQADMyYqo/edit?gid=0#gid=0)
+- Production bucket with models, training corpus, configs etc.: [moz-fx-translations-data--303e-prod-translations-data](https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data) - Uploaded models
+- [Documentation of the Firefox integration](https://firefox-source-docs.mozilla.org/toolkit/components/translations/index.html)
 
 ## Acknowledgements
+
 This project uses materials developed by:
+
 - Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
 - HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
 - OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/))
-- Many other open source projects and research papers (see [References](README.md#references))
+- Many other open source projects and research papers (see [References](docs/README.md#references))
 
 ## References
 
 
@@ -89,7 +89,7 @@ It allows writing the steps in any language (currently it's historically mostly
 represent the pipeline as a directed acyclic graph (DAG).
 
 The DAG of tasks can be launched using any workflow manager
-(currently we support only [Taskcluster](../training/task-cluster.md). [Snakemake](../training/snakemake.md) integration is unmaintained, but we accept contributions).
+(currently we support only [Taskcluster](../infrastructure/task-cluster.md). [Snakemake](../infrastructure/snakemake.md) integration is unmaintained, but we accept contributions).
 The workflow manager integration code should not include any training specific logic but rather implement it as a script
 in the `pipeline` directory.
 
 
@@ -44,7 +44,7 @@ One way to do this is by adding the dataset to the `skip_datasets` list, then it
 You can also use OpusCleaner to design custom cleaning rules for a dataset.
 See the examples of custom configs in the [/pipeline/clean/opuscleaner/configs](https://github.com/mozilla/translations/tree/main/pipeline/clean/opuscleaner/configs).
 
-See also [documentation about OpusCleaner](https://mozilla.github.io/translations/cleaning.html#opuscleaner)
+See also [documentation about OpusCleaner](https://mozilla.github.io/translations/docs/data-and-cleaning)
 
 The trick is to not filter too much. Unfortunately, it's hard to say how the filters will affect the translation quality without training the model.
 
@@ -58,10 +58,11 @@ Issues labelled as ["help wanted"](https://github.com/mozilla/translations/label
 See [Development docs](development.md) to start with configuring local development environment.
 
 Other ideas:
+
 - Adding support for a new data importer
 - Writing tests (we are far from the full code coverage)
 - Contributing to the tools we use ([OpusCleaner](https://github.com/hplt-project/OpusCleaner), [OpusTrainer](https://github.com/hplt-project/OpusTrainer))
-- Helping to figure out how to run the pipeline locally (Either with Taskcluster, see [this issue](https://github.com/mozilla/translations/issues/403) or with updating [Snakemake](../training/snakemake.md))
+- Helping to figure out how to run the pipeline locally (Either with Taskcluster, see [this issue](https://github.com/mozilla/translations/issues/403) or with updating [Snakemake](../infrastructure/snakemake.md))
 
 ## ML engineers and researchers
 
@@ -83,5 +84,5 @@ reach out to us on Matrix, and we'll consider your request.
 
 The starting point is looking at the [model training guide](../training/README.md).
 Then you can generate training configs locally with configs generator and look at the datasets (it's described in the "Inspecting datasets" section).
-When the config is ready and you have a Taskcluster account, follow the [Taskcluster docs](../training/task-cluster.md) to run training.
-You can monitor the training with the Tascluster UI and see ML charts on [Weights and Biases dashboards](https://wandb.ai/moz-translations/projects).
+When the config is ready and you have a Taskcluster account, follow the [Taskcluster docs](../infrastructure/task-cluster.md) to run training.
+You can monitor the training with the Taskcluster UI and see ML charts on [Weights and Biases dashboards](https://wandb.ai/moz-translations/projects).
@@ -11,6 +11,7 @@ The classifier scores parallel sentences from 0 to 1 where 0 means a very noisy
 If a specialized model for a language pair is not available it will fallback to downloading a multilingual en-xx model.
 
 For supported languages see:
+
   * [Bicleaner AI Releases][ai-releases]
 
 ## How to configure for training
 
@@ -1,6 +1,6 @@
 # Dataset importers
 
-Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/translations/tree/main/configs/config.test.yml).
+Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/translations/blob/main/taskcluster/configs/config.prod.yml).
 
 Example:
 ```
@@ -9,24 +9,30 @@ Example:
     - mtdata_newstest2014_ruen
 ```
 
-Data source | Prefix     | Name examples | Type | Comments
---- |------------| --- | ---| ---
-[MTData](https://github.com/thammegowda/mtdata) | mtdata     | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
-[OPUS](https://opus.nlpl.eu/) | opus       | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
-[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu  | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
-[Flores](https://github.com/facebookresearch/flores) | flores     | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages.
-Custom parallel | url        | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | corpus | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
-[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Monolingual news datasets from [WMT](https://www.statmt.org/wmt21/translation-task.html)
-[OPUS](https://opus.nlpl.eu/) | opus       | tldr-pages/v2023-08-29 | mono | Monolingual dataset from OPUS.
-[HPLT](https://hplt-project.org/datasets/v2.0) | hplt       | mono/v2.0 | mono | HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
-Custom mono | url        | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst` | mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS.
+Data source | Prefix     | Name examples                                                                                 | Type     | Comments
+--- |------------|-----------------------------------------------------------------------------------------------|----------| ---
+[MTData](https://github.com/thammegowda/mtdata) | mtdata     | newstest2017_ruen                                                                             | parallel | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
+[OPUS](https://opus.nlpl.eu/) | opus       | ParaCrawl/v7.1                                                                                | parallel   | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
+[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu  | wmt20                                                                                         | parallel   | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
+[Flores](https://github.com/facebookresearch/flores) | flores     | dev, devtest                                                                                  | parallel   | Evaluation dataset from Facebook that supports 100 languages.
+Custom parallel | url        | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | parallel   | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
+[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019                                                                                     | mono     | Monolingual news datasets from [WMT](https://www.statmt.org/wmt21/translation-task.html)
+[OPUS](https://opus.nlpl.eu/) | opus       | tldr-pages/v2023-08-29                                                                        | mono     | Monolingual dataset from OPUS.
+[HPLT](https://hplt-project.org/datasets/v2.0) | hplt       | mono/v3.0                                                                                     | mono     | HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
+Custom mono | url        | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst`     | mono     | A custom zst compressed monolingual dataset, for instance uploaded to GCS.
 
-You can also use [find-corpus](https://github.com/mozilla/translations/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
+You can also use [find-corpus](https://github.com/mozilla/translations/blob/main/utils/find_corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
 
 Set up a local [poetry](https://python-poetry.org/) environment.
-```
+```bash
 task find-corpus -- en ru
 ```
+
+The config generator uses `find-corpus` to generate a training config automatically and include all the available datasets:
+```bash
+task config-generator -- ru en --name test
+```
+
 Make sure to check licenses of the datasets before using them.
 
 ## Adding a new importer
 
@@ -28,9 +28,9 @@ They will correspond to `opus_...` training datasets in the training pipeline co
 
 Configure cleaning rules for the datasets in the UI.
 
-Copy JSON files for the produced filters `data/train-parts/*.filter.json` to 
-`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/` for langauge pair and dataset specific filters 
-(such filters will also apply to the opposite langauge pair)
+Copy JSON files for the produced filters `data/train-parts/*.filter.json` to
+`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/` for language pair and dataset specific filters
+(such filters will also apply to the opposite language pair)
 
 or to 
 
@@ -51,6 +51,7 @@ The `<src>` and `<trg>` in the template will be automatically replaced with the
 The generated default config will be copied to the target dataset cleaning directory.
 
 The config is chosen based on this search order:
+
 1. Dataset and language specific: `configs/<language-pair>/<dataset>.filter.json`
 2. Language specific: `configs/<language-pair>/default.filter.json`
 3. Dataset specific: `configs/<dataset>.filter.json`
@@ -61,7 +62,61 @@ The first found config will be applied.
 If the desired behaviour is to apply only the default config template and skip all possible custom configs
 for the current language pair and/or datasets, set `opuscleaner-mode: defaults`.
 
+### Language codes
+
+OpusCleaner uses many external tools in its filters. It means support of language code schemes for specific tools can differ.
+For some languages it's required to replace `<src>`, `<trg>` in the `filter.json` to the tools specific language codes.
+
+For example, for Chinese Traditional we use:
+
+- Pipeline code: `zh_hant`
+- It maps to OpusCleaner code: `zh_Hant`
+- It is changed in the [OpusCleaner config](https://github.com/mozilla/translations/blob/main/pipeline/clean/opuscleaner/configs/en-zh_hant/default.filters.json) for fasttext filter with openlid-v2 model to `cmn`:
+```json
+    {
+      "filter": "fasttext_filter",
+      "parameters": {
+        "FASTTEXT_MODEL_TYPE": "openlid-v2",
+        "LANG1": "eng",
+        "LANG2": "cmn"
+    }
+```
+
+See more details about the supported languages and language code mappings [here](../training/languages.md).
+
 ## Bicleaner
 
 It is recommended to use Bicleaner ML models to filter noisy data.
 See the [bicleaner documentation](bicleaner.md) for more details on how to configure it.
+
+
+## Monolingual cleaning
+
+Currently, it does not run OpusCleaner as monolingual filters are not fully supported. 
+It runs legacy Bergamot cleaning scripts that include alphabet ratios and fast text. 
+Also, it runs [Monocleaner](https://github.com/bitextor/monocleaner) to filter based on fluency.
+
+Monocleaner thresholds can be adjusted in the training config:
+
+```yaml
+  # Monocleaner filters sentences in monolingual corpus based on language fluency
+  # Use sanitized dataset names for compatibility with Taskcluster (replace ".", "/", ":", "[", "]" to "_")
+  monocleaner:
+    mono-src:
+      # News-crawl is typically clean, enable on dataset by dataset basis
+      default-threshold: 0.0
+      dataset-thresholds:
+        # We already filter it by document score, remove only the noisiest segments
+        hplt_mono_v2_0: 0.5
+        # Filter only garbage from NLLB
+        opus_NLLB_v1: 0.5
+    mono-trg:
+      # News-crawl is typically clean, enable on dataset by dataset basis
+      default-threshold: 0.0
+      dataset-thresholds:
+        # We already filter HPLT by document score, so it's relatively clean,
+        # but let's still apply the default threshold for monocleaner to get more fluent target texts for back-translations
+        hplt_mono_v2_0: 0.7
+        # Sentences for back-translations should be in fluent language, apply even more aggressive threshold for NLLB
+        opus_NLLB_v1: 0.8
+```