Bulgarian language pipeline #11957
Replies: 2 comments 1 reply
-
Hello, thanks for working on making this pipeline, it'd be great to have support for Bulgarian! Looking over this, is it correct this has no NER component? We usually don't release a new pipeline without an NER component included. Besides that, one important thing for new pipelines is that we don't just get the built pipeline, but we have the ability to train it ourselves. Do you have code for training the pipeline somewhere public? |
Beta Was this translation helpful? Give feedback.
-
Hello, @polm, thanks for your message. There is still not a NER component, as the Bulgarian UDep data doesn't include NER tags. I have other datasets for NER, but I would say creating the NER component is a separate task for now. Is it essential for releasing the pipeline, or it can be left for a future moment? I still don't have publically available code. I train the pipeline via a Jupyter Notebook, so it would be easy to share. I was wondering if you have any requirements regarding the code. All the best, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear spacy team,
I am a PhD student in Computer Science in Sofia University, Bulgaria and I’m working on a pretrained language pipeline for spacy v.3. It is based on data from the Bulgarian Universal dependencies treebank (https://github.com/UniversalDependencies/UD_Bulgarian-BTB) and Bulgarian fasttext vectors (https://fasttext.cc/docs/en/pretrained-vectors.html).
The pipeline is close to ready, it just needs some additional hyperparameter search and to be updated for version v.3.4
A conference poster about the project is available here: https://www.researchgate.net/publication/362302569_Language_pipeline_for_Bulgarian_Language/ My aim is to publish a paper on the pipeline creation that will be a part of my PhD thesis.
Besides the model, I have updated the language data available for Bulgarian - stop words list and token exception lists.
Current results:
TOK 99.97
TAG 94.75
POS 98.12
MORPH 95.79
LEMMA 93.88
UAS 89.95
LAS 84.77
SENT P 92.62
SENT R 96.77
SENT F 94.65
SPEED 926
Details about the models:
Type: Core: Vocabulary, syntax
Size: 40 mb (small), 443 mb (large)
Genre: News, media
Components: 'tok2vec', 'tagger', 'morphologizer', 'sentencizer', 'parser',
'trainable_lemmatizer'
Source: UD Bulgarian BTB https://github.com/UniversalDependencies/UD_Bulgarian-BTB
fatstext vectors for Bulgarian: https://fasttext.cc/docs/en/pretrained-vectors.html (~2.4 GB)
Desired license: CC BY-SA 4.0
I will be happy to get in contact with you to know how to proceed.
All the best,
Melania
Beta Was this translation helpful? Give feedback.
All reactions