Basic question about Lexemes #9883

sonynavdeep81 · 2021-12-16T15:25:13Z

sonynavdeep81
Dec 16, 2021

It is stated that everything is stored in spaCy in the form of hash values for better memory utilization. So even a string in Vocabulary is stored as a hash value. It is said that the strings along with other metadata is stored in Lexemes and the collection of these Lexemes forms the vocabulary. But I am not able to understand that whenever I load en_core_web_lg it says that approximately 7,08,206 different strings are stored in the Vocabulary and it can be confirmed using print(len(nlp.vocab.strings)) but when I check the number of lexemes in the vocabulary using print(len(nlp.vocab)) it shows 773. Why is there such a difference? Shouldn't the number of lexemes be equal to the number of strings stored in the vocabulary as each string can be accessed using the lexeme like nlp.vocab['quick'].text . I am confused, how is the vocabulary created, is it made of lexemes or strings. How are lexemes are strings stored in the vocabulary related? It will be much better if this can be explained visually but would appreciate even the textual explanation. Thank you.

Answered by adrianeboyd

Dec 17, 2021

I'd recommend this section of the docs: https://spacy.io/usage/spacy-101#vocab

And there's a graphical overview here: https://spacy.io/api

Here's an explanation I wrote on stackoverflow for a similar question (https://stackoverflow.com/a/68889010):

There's no real "vocab" count in spaCy v2.3 or v3. You should mainly think of nlp.vocab and nlp.vocab.strings as caches where the total count isn't a meaningful value. The nlp.vocab Vocab is not static and grows as you process texts with the pipeline.

The vocab is a cache of Lexeme objects and the nlp.vocab.strings StringStore is a cache of string hashes. The vocab contains lexemes for tokens that have been seen before in some text that has be…

View full answer

adrianeboyd · 2021-12-17T07:37:30Z

adrianeboyd
Dec 17, 2021

I'd recommend this section of the docs: https://spacy.io/usage/spacy-101#vocab

And there's a graphical overview here: https://spacy.io/api

Here's an explanation I wrote on stackoverflow for a similar question (https://stackoverflow.com/a/68889010):

There's no real "vocab" count in spaCy v2.3 or v3. You should mainly think of nlp.vocab and nlp.vocab.strings as caches where the total count isn't a meaningful value. The nlp.vocab Vocab is not static and grows as you process texts with the pipeline.

The vocab is a cache of Lexeme objects and the nlp.vocab.strings StringStore is a cache of string hashes. The vocab contains lexemes for tokens that have been seen before in some text that has been processed by the pipeline and the string store contains strings that have been seen before, either as tokens or as annotations (POS labels, lemmas, dependency labels).

The string store is not 100% a cache and may contain strings added during training that haven't been used before in the currently loaded pipeline, but the size of the string store doesn't tell you anything about the pipeline performance.

1 reply

sonynavdeep81 Dec 17, 2021
Author

Thanks a lot for such a nice explanation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Basic question about Lexemes #9883

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Basic question about Lexemes #9883

Uh oh!

sonynavdeep81 Dec 16, 2021

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Dec 17, 2021

Uh oh!

sonynavdeep81 Dec 17, 2021 Author

sonynavdeep81
Dec 16, 2021

Replies: 1 comment 1 reply

adrianeboyd
Dec 17, 2021

sonynavdeep81 Dec 17, 2021
Author