how to replicate the training of one of spacy's pipelines #9288

XavBeckers · 2021-09-25T14:19:32Z

XavBeckers
Sep 25, 2021

Hello,

I'm trying to replicatethe training of your pipeline named "fr_dep_news_trf" to be able to transpose it on another dataset with another label scheme as accurately as possible.
Using fr_dep_news_trf config.cfg and dataset (the sequoia treebank), I don't manage to get such good results : my DEP_UAS and DEP_LAS scores are only 0.8977848642 and 0.8774906069 where fr_dep_news_trf's scores are respectively 0.95 and 0.93.

I'm using fr_dep_news_trf's config file (except train and dev path) and I'm training the models on Ubuntu 18.04 with pytorch and CUDA 11.1.
For the datasets, I used the same version of treebank sequoia as the one in fr_dep_news_trf, running the training with the train and dev part. I converted them with the -n 10 option as recommended.

If necessary, the config.cfg and meta.json files of the trained models are attached.

Thank you in advance.
metajson and configcfg.zip

Answered by adrianeboyd

Sep 27, 2021

Hmm, as long as you're using UD French Sequoia v2.5 and the exact same config, that sounds unexpected. Our reported evaluation is on the dev set rather than the test set, so maybe that explains the difference? For that particular corpus I'd be surprised if the splits were so different, but for some UD corpora there are large differences/imbalances between test and the other splits. (We're concerned about repeatedly evaluating on the test sets in case we want to run a clean evaluation for a future publication, so we set the test sets aside and don't use them in our standard training setup.)

View full answer

adrianeboyd · 2021-09-27T06:52:46Z

adrianeboyd
Sep 27, 2021

Hmm, as long as you're using UD French Sequoia v2.5 and the exact same config, that sounds unexpected. Our reported evaluation is on the dev set rather than the test set, so maybe that explains the difference? For that particular corpus I'd be surprised if the splits were so different, but for some UD corpora there are large differences/imbalances between test and the other splits. (We're concerned about repeatedly evaluating on the test sets in case we want to run a clean evaluation for a future publication, so we set the test sets aside and don't use them in our standard training setup.)

2 replies

XavBeckers Sep 27, 2021
Author

I think I used UD French Sequoia v2.7 as I couldn't find the 2.5 one. However, the treebank shouldn't be that different. Here are the update details :

2020-11-14 v2.7

No major change

2020-05-15 v2.6

Fix some annotation errors found during PARSEME-Fr project
Fix some inconsistent lemmas and morphological features

As it didn't change heavily the sentences inside the treebank, I'm not sure it could cause a ~5% difference.
For the evaluation, I think I'm using the dev sets aswell : I didn't write the path of the test set anywhere so I don't think spaCy could have used it.
And for the config.cfg, I double checked it with Notepad++'s compare plugin: my config file is strictly the same (except train/dev paths (2nd and 3rd lines) than the config.cfg of fr_dep_news_trf 3.1.0.

However, in the sources of fr_dep_news_trf, you add "spaCy lookup datas" and "camembert-base". Is it used natively by the pipeline before the training or should I do something with them (and if yes, how) ?

adrianeboyd Sep 27, 2021

You can find older releases in tags of the UD github repos (https://github.com/UniversalDependencies/UD_French-Sequoia/releases/tag/r2.5), or by copying and pasting the links here to the official, full older releases of all the treebanks: https://universaldependencies.org/#download. But it doesn't look like the dependency trees have changed much since r2.5.

You definitely want to use camembert-base, but this looks fine in your config, and it should be automatically downloaded when the pipeline is initialized in spacy train. I think you can skip the spacy-lookups-data settings in [initialize.lookups] because there isn't a French normalization table to load here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

how to replicate the training of one of spacy's pipelines #9288

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

how to replicate the training of one of spacy's pipelines #9288

Uh oh!

XavBeckers Sep 25, 2021

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Sep 27, 2021

Uh oh!

XavBeckers Sep 27, 2021 Author

Uh oh!

adrianeboyd Sep 27, 2021

XavBeckers
Sep 25, 2021

Replies: 1 comment 2 replies

adrianeboyd
Sep 27, 2021

XavBeckers Sep 27, 2021
Author