Is Bloom Embedding also use for Chinese? #11053

lingvisa · 2022-06-30T04:20:58Z

lingvisa
Jun 30, 2022

I am reading this tutorial on Bloom Embedding: https://explosion.ai/blog/bloom-embeddings.

I am thinking about the possibility of using Spacy to train NER models for Chinese. Is bloom-embedding also used for Chinese? Chinese has no concept of subword for the most part, and I am just curious how effective the NER algorithm in Spacy could be for Chinese?

Answered by adrianeboyd

Jun 30, 2022

It depends a lot on the segmentation. If you segment on characters, there's probably no point in having floret vectors, since you'd just have vectors for single characters either way.

But if you segment into longer words, then there could be improvements from using short ngrams with floret. You could try 1-grams or 1-2-grams and see if it helped. I'm not sure how much it would help for syntax, but my initial guess would be that it would at least help with NER in particular in related to compounds?

You could adapt an existing project for your Chinese dataset to try it out:

https://github.com/explosion/projects/tree/v3/pipelines/floret_fi_core_demo

Edited to add: you'd probably need to edit…

View full answer

adrianeboyd · 2022-06-30T08:32:53Z

adrianeboyd
Jun 30, 2022

It depends a lot on the segmentation. If you segment on characters, there's probably no point in having floret vectors, since you'd just have vectors for single characters either way.

But if you segment into longer words, then there could be improvements from using short ngrams with floret. You could try 1-grams or 1-2-grams and see if it helped. I'm not sure how much it would help for syntax, but my initial guess would be that it would at least help with NER in particular in related to compounds?

You could adapt an existing project for your Chinese dataset to try it out:

https://github.com/explosion/projects/tree/v3/pipelines/floret_fi_core_demo

Edited to add: you'd probably need to edit the tokenization scripts and configs before training to set the right Chinese tokenizer, making sure that you're using the same one everywhere.

1 reply

lingvisa Jun 30, 2022
Author

Yes, I will start with the demo first above. Thanks.

lingvisa · 2022-07-08T23:28:52Z

lingvisa
Jul 8, 2022
Author

@adrianeboyd Is the current Chinese NER model in Spacy based on PKU segmentation, or no segmentation at all?

1 reply

polm Jul 11, 2022

It's a little buried in the docs, but the pretrained pipelines use a custom PKU segmentation model to match OntoNotes.

lingvisa · 2022-07-12T01:51:47Z

lingvisa
Jul 12, 2022
Author

For the transformer based models, if bert-base-chinese is char segmentation, how does it work with NER model with PKU segmentation, which uses word segmentation?

Also according to this tutorial, https://spacy.io/universe/project/video-spacys-ner-model,

'spaCy v2.0's Named Entity Recognition system features a sophisticated word embedding strategy using subword features and "Bloom" embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing. '

In spaCy v3, the NER system adds another transformer embedding option to replace CNN, but still uses transition-based approach to NER. Is that right?

3 replies

polm Jul 12, 2022

spacy-transformers is built with the assumption that the Transformers tokenization won't match the spaCy tokenization. An alignment is produced and the vector for a spaCy token is the average of the word pieces (or whatever) from the Transformers model.

lingvisa Jul 22, 2022
Author

For tok2vec model, I got some tokens with all-zero vectors, probably because they are oov words during tok2vec training? Instead if I use a transformer model, each token(span) will be guarheened to have a non-zero vector, since "the vector for a spaCy token is the average of the word pieces (or whatever) from the Transformers model"? Please explain this a bit more.

polm Jul 24, 2022

tok2vec models learn tokens, so tokens they haven't seen before can be unks, or oov (out-of-vocabulary). These get zero vectors. This can be worked around with subword vectors, like in floret.

Transformers use word pieces to build a vocabulary of fixed size, and input is broken into these word pieces rather than natural language words. These work like subwords, and typically all characters in the input do have representations, so it's always possible to get some kind of vector. (This isn't totally true though, as if you get really obscure characters like 彁 they may not be covered by any word pieces. I am not sure what the Transformer does in that case.)

lingvisa · 2022-07-12T01:56:53Z

lingvisa
Jul 12, 2022
Author

Also, I checked the Spacy model sizes, all models are very tiny in size. For example, the NER model:

(venv) congminmin@congmins-MacBook-Pro-2 ner % ls -lah 
total 632
drwxr-xr-x   5 congminmin  staff   160B Jul 11 18:35 .
drwxr-xr-x  15 congminmin  staff   480B Jul 11 18:36 ..
-rw-r--r--   1 congminmin  staff   221B Jul 11 18:35 cfg
-rw-r--r--   1 congminmin  staff   306K Jul 11 18:35 model
-rw-r--r--   1 congminmin  staff   1.0K Jul 11 18:35 moves

How does it make them so small?

3 replies

polm Jul 12, 2022

What model are you looking at? The NER model in zh_core_web_sm is 6.5MB.

Models for specific components will usually be small since the large parts, like word vectors or Transformers, will be in the tok2vec. It totally depends on the model though.

lingvisa Jul 12, 2022
Author

I am looking at the transformer ner model. So, the transformer model info is also stored into tok2vec?

polm Jul 12, 2022

The name of the directory the model is stored in matches the pipeline name, so it would be transformer for the pretrained pipeline.

lingvisa · 2022-07-13T00:37:12Z

lingvisa
Jul 13, 2022
Author

Here, https://spacy.io/models, for the CNN/CPU pipeline, how does it run so fast without using GPU while CNN is used? It seems only transformer requires GPU.

1 reply

polm Jul 13, 2022

There's no one particular trick that makes it fast, optimizing for speed is something we've put a lot of effort into.

Both CNN models and Transformers can be used with or without GPU. For Transformers, while usable with without GPU, they are too slow to be useful for most people, though we have had a few people say it works for their use case.

Uh oh!

Is Bloom Embedding also use for Chinese? #11053

Uh oh!

Replies: 5 comments · 9 replies

Uh oh!

Uh oh!

Uh oh!

lingvisa Jun 30, 2022 Author

Uh oh!

lingvisa Jul 8, 2022 Author

Uh oh!

Uh oh!

lingvisa Jul 12, 2022 Author

Uh oh!

Uh oh!

lingvisa Jul 22, 2022 Author

Uh oh!

Uh oh!

lingvisa Jul 12, 2022 Author

Uh oh!

Uh oh!

lingvisa Jul 12, 2022 Author

Uh oh!

Uh oh!

Uh oh!

lingvisa Jul 13, 2022 Author

Uh oh!

Replies: 5 comments 9 replies

lingvisa Jun 30, 2022
Author

lingvisa
Jul 8, 2022
Author

lingvisa
Jul 12, 2022
Author

lingvisa Jul 22, 2022
Author

lingvisa
Jul 12, 2022
Author

lingvisa Jul 12, 2022
Author

lingvisa
Jul 13, 2022
Author