Bulgarian language pipeline #11957

melaniab · 2022-12-09T10:59:51Z

melaniab
Dec 9, 2022

Dear spacy team,

I am a PhD student in Computer Science in Sofia University, Bulgaria and I’m working on a pretrained language pipeline for spacy v.3. It is based on data from the Bulgarian Universal dependencies treebank (https://github.com/UniversalDependencies/UD_Bulgarian-BTB) and Bulgarian fasttext vectors (https://fasttext.cc/docs/en/pretrained-vectors.html).
The pipeline is close to ready, it just needs some additional hyperparameter search and to be updated for version v.3.4

A conference poster about the project is available here: https://www.researchgate.net/publication/362302569_Language_pipeline_for_Bulgarian_Language/ My aim is to publish a paper on the pipeline creation that will be a part of my PhD thesis.

Besides the model, I have updated the language data available for Bulgarian - stop words list and token exception lists.

Current results:
TOK 99.97
TAG 94.75
POS 98.12
MORPH 95.79
LEMMA 93.88
UAS 89.95
LAS 84.77
SENT P 92.62
SENT R 96.77
SENT F 94.65
SPEED 926

Details about the models:
Type: Core: Vocabulary, syntax
Size: 40 mb (small), 443 mb (large)
Genre: News, media
Components: 'tok2vec', 'tagger', 'morphologizer', 'sentencizer', 'parser',
'trainable_lemmatizer'

Source: UD Bulgarian BTB https://github.com/UniversalDependencies/UD_Bulgarian-BTB
fatstext vectors for Bulgarian: https://fasttext.cc/docs/en/pretrained-vectors.html (~2.4 GB)
Desired license: CC BY-SA 4.0

I will be happy to get in contact with you to know how to proceed.

All the best,
Melania

polm · 2022-12-12T04:41:57Z

polm
Dec 12, 2022

Hello, thanks for working on making this pipeline, it'd be great to have support for Bulgarian!

Looking over this, is it correct this has no NER component? We usually don't release a new pipeline without an NER component included.

Besides that, one important thing for new pipelines is that we don't just get the built pipeline, but we have the ability to train it ourselves. Do you have code for training the pipeline somewhere public?

0 replies

melaniab · 2022-12-12T08:58:50Z

melaniab
Dec 12, 2022
Author

Hello, @polm, thanks for your message.

There is still not a NER component, as the Bulgarian UDep data doesn't include NER tags. I have other datasets for NER, but I would say creating the NER component is a separate task for now. Is it essential for releasing the pipeline, or it can be left for a future moment?

I still don't have publically available code. I train the pipeline via a Jupyter Notebook, so it would be easy to share. I was wondering if you have any requirements regarding the code.

All the best,
Melania

1 reply

polm Dec 12, 2022

Thanks for the clarifications!

An NER component has been a requirement for all pipelines we've released to date. Separately from that, we noticed that your data has a non-commercial license - I think there are a few exceptions with pipelines we've had for a long time, but for our official pipelines we avoid that, we aren't adding any new pipelines with non-commercial data licenses. In this case what we would recommend is that you host the pipeline yourself and add it to the Universe.

If you're adding your pipeline to the Universe this isn't relevant, but about the source code for training, we would need access to that so that we can re-train the pipeline when we release a new version of spaCy, for example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bulgarian language pipeline #11957

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Bulgarian language pipeline #11957

Uh oh!

melaniab Dec 9, 2022

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

polm Dec 12, 2022

Uh oh!

melaniab Dec 12, 2022 Author

Uh oh!

polm Dec 12, 2022

melaniab
Dec 9, 2022

Replies: 2 comments 1 reply

polm
Dec 12, 2022

melaniab
Dec 12, 2022
Author