Ukrainian model proposal #10561

kurnosovv · 2022-03-28T10:31:15Z

kurnosovv
Mar 28, 2022

@honnibal @adrianeboyd
I created SpaCy configs for Ukrainian language, and would like to propose to train the model and add them to the official SpaCy models registry.

Training configs are

https://github.com/kurnosovv/ukr-spacy/tree/main/spacy_project (CPU model with pretrained word2vec vectors)
https://github.com/kurnosovv/ukr-spacy/tree/main/spacy_project_trf (transformer-based model).

All training sources are available under MIT license. Training and evaluation data are silver standard datasets for Ukrainian language.

Could you train and add such models to official SpaCy models registry? During my attempts, training stops much earlier than specified num_epochs, which seems lead to lower accuracy than may be achieved in the best case. Maybe you can modify training procedure, or change training config, so that it will lead to higher-accuracy models. I may assist you, and answer any questions.

adrianeboyd · 2022-03-28T11:19:24Z

adrianeboyd
Mar 28, 2022

Thanks, it would be great to be able to add Ukrainian pipelines! Can you provide some more information about the sources/citations for the training data? If you'd prefer to discuss it over email, you can contact me at [email protected].

0 replies

kurnosovv · 2022-03-28T13:41:22Z

kurnosovv
Mar 28, 2022
Author

At the moment, gold-standard Ukrainian language datasets are:

NER: https://github.com/lang-uk/ner-uk
Morphology and syntax: https://universaldependencies.org/treebanks/uk_iu/index.html
Both of them are under CC BY-NC-SA 4.0 licence.

For creating silver-standard data, the following steps were performed:

I trained transformer-based models with quite high accuracy using these datasets
I used a News subset of Ukrainian UberText Corpus (described here https://lang.org.ua/en/corpora/), it contains shuffled sentences extracted from news articles. I took subsamples of ~1M sentences for training data, and ~100K sentences for validation data
I applied transformer-based models to make predictions for sampled sentences, and saved them in conllu format
word2vec vectors are taken from https://lang.org.ua/en/models/ (UberCorpus, 300d as is)

The resulting dataset is synthetic data, stored on my Google Drive (I have not posted it anywhere at the moment)

6 replies

kurnosovv Mar 29, 2022
Author

I will need some time to fix the dataset and publish it. May I use web dataset (Common Crawl) instead of news dataset, does it make any difference?

adrianeboyd Mar 31, 2022

I would be worried that Common Crawl is too noisy to provide good texts for this kind of silver annotation since the models are trained on news texts? And the licensing for the texts is a problem if you want to publish it directly.

kurnosovv Apr 7, 2022
Author

@adrianeboyd I modified the dataset. Instead of Ukrainian UberText Corpus, now the News subset of Leipzig Corpora Collection (https://wortschatz.uni-leipzig.de/en/download/Ukrainian) is used as corpus. The annotation process is the same.

The data is published as a huggingface dataset: https://huggingface.co/datasets/ukr-models/Ukr-Synth

adrianeboyd Apr 8, 2022

Thanks, can you add a license to the model card?

(As a note, the conllu data looks fine, but some of the columns in the dataset look like they might not have been configured correctly: why are upos and xpos the same data in a different format, the type of heads, and the type/values of deps.)

kurnosovv Apr 18, 2022
Author

License information added

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ukrainian model proposal #10561

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Ukrainian model proposal #10561

Uh oh!

kurnosovv Mar 28, 2022

Replies: 2 comments · 6 replies

Uh oh!

Uh oh!

adrianeboyd Mar 28, 2022

Uh oh!

kurnosovv Mar 28, 2022 Author

Uh oh!

kurnosovv Mar 29, 2022 Author

Uh oh!

adrianeboyd Mar 31, 2022

Uh oh!

kurnosovv Apr 7, 2022 Author

Uh oh!

adrianeboyd Apr 8, 2022

Uh oh!

kurnosovv Apr 18, 2022 Author

kurnosovv
Mar 28, 2022

Replies: 2 comments 6 replies

adrianeboyd
Mar 28, 2022

kurnosovv
Mar 28, 2022
Author

kurnosovv Mar 29, 2022
Author

kurnosovv Apr 7, 2022
Author

kurnosovv Apr 18, 2022
Author