Skip to content

Commit 0278b40

Browse files
authored
Change opuscleaner-mode to 'customs' when generating a config (#1370)
1 parent e48e528 commit 0278b40

File tree

2 files changed

+7
-2
lines changed

2 files changed

+7
-2
lines changed

docs/data-and-cleaning/index.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ See examples in the directory.
4141

4242
### Default configs
4343

44-
Set `opuscleaner-mode: custom` in the training config to use custom per-dataset and per-language pair configs.
44+
Set `opuscleaner-mode: custom` (this is the default when generating a config) in the training config to use custom per-dataset and per-language pair configs.
4545

4646
If no custom config was specified for the dataset,
4747
the [default config template](https://github.com/mozilla/translations/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.
@@ -58,6 +58,9 @@ The config is chosen based on this search order:
5858

5959
The first found config will be applied.
6060

61+
If the desired behaviour is to apply only the default config template and skip all possible custom configs
62+
for the current language pair and/or datasets, set `opuscleaner-mode: defaults`.
63+
6164
## Bicleaner
6265

6366
It is recommended to use Bicleaner ML models to filter noisy data.

taskcluster/configs/config.prod.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,9 @@ experiment:
2525
# Use the Opus Cleaner tool on the data cleaning step.
2626
# https://github.com/hplt-project/OpusCleaner
2727
# "custom" to use dataset specific configs, "defaults" to use the same default setting for all datasets
28-
opuscleaner-mode: "defaults"
28+
# been using "defaults" for some time, but now "custom" will be used
29+
# as all the custom filters in the pipeline are made to be used in future retrainings
30+
opuscleaner-mode: "custom"
2931

3032
# Archive corpora from alignments tasks to GCS
3133
archive-corpora: true

0 commit comments

Comments
 (0)