What is the best way to add new words to model? #9714

info2000 · 2021-11-21T17:09:28Z

info2000
Nov 21, 2021

I'm using the es_core_news_lg model to do textcat task
On the eval results, I got this
ℹ 500000 vectors (500000 unique keys, 300 dimensions) ⚠ **70394 words in training data without vectors (3%)** 10 most common words without vectors: ' ' (19156), ' ' (5907), ' ' (778), 'Covid-19' (505), ' ' (331), ' ' (244), 'Wwwhatsnew.com' (240), 'Invítanos' (180), 'leyéndonos!La' (180), 'Covid' (172)

What is the best way to vectorize new words?

Thanks

Answered by ljvmiranda921

Nov 22, 2021

Hi @info2000 ,

What is the best way to vectorize new words?

To do this you need to train the vectors. However at this stage, a 3% out-of-vocab training data is still low. You might want to check first why there are many whitespace tokens in your dataset (note that the top 3 common "words" are all whitespaces). Perhaps clean the data a bit more and see how it works :)

View full answer

ljvmiranda921 · 2021-11-22T05:59:50Z

ljvmiranda921
Nov 22, 2021

Hi @info2000 ,

What is the best way to vectorize new words?

To do this you need to train the vectors. However at this stage, a 3% out-of-vocab training data is still low. You might want to check first why there are many whitespace tokens in your dataset (note that the top 3 common "words" are all whitespaces). Perhaps clean the data a bit more and see how it works :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What is the best way to add new words to model? #9714

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

What is the best way to add new words to model? #9714

Uh oh!

info2000 Nov 21, 2021

Replies: 1 comment

Uh oh!

ljvmiranda921 Nov 22, 2021

info2000
Nov 21, 2021

ljvmiranda921
Nov 22, 2021