Is Bloom Embedding also use for Chinese? #11053
-
I am reading this tutorial on Bloom Embedding: https://explosion.ai/blog/bloom-embeddings. I am thinking about the possibility of using Spacy to train NER models for Chinese. Is bloom-embedding also used for Chinese? Chinese has no concept of subword for the most part, and I am just curious how effective the NER algorithm in Spacy could be for Chinese? |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 9 replies
-
It depends a lot on the segmentation. If you segment on characters, there's probably no point in having floret vectors, since you'd just have vectors for single characters either way. But if you segment into longer words, then there could be improvements from using short ngrams with floret. You could try 1-grams or 1-2-grams and see if it helped. I'm not sure how much it would help for syntax, but my initial guess would be that it would at least help with NER in particular in related to compounds? You could adapt an existing project for your Chinese dataset to try it out: https://github.com/explosion/projects/tree/v3/pipelines/floret_fi_core_demo Edited to add: you'd probably need to edit the tokenization scripts and configs before training to set the right Chinese tokenizer, making sure that you're using the same one everywhere. |
Beta Was this translation helpful? Give feedback.
-
@adrianeboyd Is the current Chinese NER model in Spacy based on PKU segmentation, or no segmentation at all? |
Beta Was this translation helpful? Give feedback.
-
For the transformer based models, if bert-base-chinese is char segmentation, how does it work with NER model with PKU segmentation, which uses word segmentation? Also according to this tutorial, https://spacy.io/universe/project/video-spacys-ner-model, 'spaCy v2.0's Named Entity Recognition system features a sophisticated word embedding strategy using subword features and "Bloom" embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing. ' In spaCy v3, the NER system adds another transformer embedding option to replace CNN, but still uses transition-based approach to NER. Is that right? |
Beta Was this translation helpful? Give feedback.
-
Also, I checked the Spacy model sizes, all models are very tiny in size. For example, the NER model:
How does it make them so small? |
Beta Was this translation helpful? Give feedback.
-
Here, https://spacy.io/models, for the CNN/CPU pipeline, how does it run so fast without using GPU while CNN is used? It seems only transformer requires GPU. |
Beta Was this translation helpful? Give feedback.
It depends a lot on the segmentation. If you segment on characters, there's probably no point in having floret vectors, since you'd just have vectors for single characters either way.
But if you segment into longer words, then there could be improvements from using short ngrams with floret. You could try 1-grams or 1-2-grams and see if it helped. I'm not sure how much it would help for syntax, but my initial guess would be that it would at least help with NER in particular in related to compounds?
You could adapt an existing project for your Chinese dataset to try it out:
https://github.com/explosion/projects/tree/v3/pipelines/floret_fi_core_demo
Edited to add: you'd probably need to edit…