Skip to content
Discussion options

You must be logged in to vote

It depends a lot on the segmentation. If you segment on characters, there's probably no point in having floret vectors, since you'd just have vectors for single characters either way.

But if you segment into longer words, then there could be improvements from using short ngrams with floret. You could try 1-grams or 1-2-grams and see if it helped. I'm not sure how much it would help for syntax, but my initial guess would be that it would at least help with NER in particular in related to compounds?

You could adapt an existing project for your Chinese dataset to try it out:

https://github.com/explosion/projects/tree/v3/pipelines/floret_fi_core_demo

Edited to add: you'd probably need to edit…

Replies: 5 comments 9 replies

Comment options

You must be logged in to vote
1 reply
@lingvisa
Comment options

Answer selected by polm
Comment options

You must be logged in to vote
1 reply
@polm
Comment options

Comment options

You must be logged in to vote
3 replies
@polm
Comment options

@lingvisa
Comment options

@polm
Comment options

Comment options

You must be logged in to vote
3 replies
@polm
Comment options

@lingvisa
Comment options

@polm
Comment options

Comment options

You must be logged in to vote
1 reply
@polm
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / zh Chinese language data and models feat / vectors Feature: Word vectors and similarity
3 participants