Custom word2vec static impact on tok2vec settings #12884

python3Berg · 2023-08-02T20:46:48Z

python3Berg
Aug 2, 2023

Hello,
Have started to test different configs of gensim word2vec to create domain specific vectors. There are several settings in the word2vec training that seem to have similar meaning to settings in components.tok2vec.model.encode.

How should the respective settings relate to each other? Is one of the word2vec algo's (CBOT, skip-gram) better suited to spacy training?

My base tok2vec settings include the below. My initial word2vec models have a vector_size of 300 and window of 4. I believe vector_size of 300 is consistent with en_web_lg, but where does window come into play. If my documents tend to be verbose, would widening window size potentially improve performance? Is there a way to express this in spacy configs?

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3


Any guidance you can provide is appreciated.

Answered by adrianeboyd

Aug 3, 2023

Some of these settings have similar names (the underlying concepts are similar), but the word2vec settings for the static word vectors are completely separate from the tok2vec settings. For tok2vec, see: https://spacy.io/api/architectures/#tok2vec-arch

There are a large number of hyperparameters for word2vec and most of them influence each other, so it's hard to give simple advice. We can mainly recommend evaluating with your downstream task. (There are some similarity-related measures that can be used for intrinsic evaluation of word vectors, but they often don't correlate well with the downstream performance on other types of tasks.)

View full answer

adrianeboyd · 2023-08-03T12:36:27Z

adrianeboyd
Aug 3, 2023

Some of these settings have similar names (the underlying concepts are similar), but the word2vec settings for the static word vectors are completely separate from the tok2vec settings. For tok2vec, see: https://spacy.io/api/architectures/#tok2vec-arch

There are a large number of hyperparameters for word2vec and most of them influence each other, so it's hard to give simple advice. We can mainly recommend evaluating with your downstream task. (There are some similarity-related measures that can be used for intrinsic evaluation of word vectors, but they often don't correlate well with the downstream performance on other types of tasks.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom word2vec static impact on tok2vec settings #12884

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom word2vec static impact on tok2vec settings #12884

Uh oh!

python3Berg Aug 2, 2023

Replies: 1 comment

Uh oh!

adrianeboyd Aug 3, 2023

python3Berg
Aug 2, 2023

adrianeboyd
Aug 3, 2023