Model is predicting extra labels? #12457

siennathesane · 2023-03-23T04:48:08Z

siennathesane
Mar 23, 2023

I noticed the pre-trained pipelines will predict labels which aren't explicitly mentioned in the models doc page, and aren't referenced in the models themselves.

Here's a reproducible example with JSON output:

import spacy
from json import dumps

models = ["sm", "md", "lg", "trf"]
text = "The third oldest French bridge."
all_known_labels = {}

for model in models:
    model = "en_core_web_{0}".format(model)
    nlp = spacy.load(model)
    all_known_labels[model] = {}
    all_known_labels[model]["labels"] = [{pipe[0]: list(nlp.get_pipe(pipe[0]).labels)} for pipe in nlp.components]
    all_known_labels[model]["parsed"] = nlp(text).to_json()

print(dumps(all_known_labels, indent=2))

The model's pipelines aren't aware/configured/trained on the ADJ label, but all 4 models are predicting old (oldest) and French (French) as ADJ parts of speech. The parser pipeline component is correctly marking them as amod dependencies, but none of any of the model's have any reference to the ADJ label.

I'm working on entity extraction, so adjectives describe the properties of a thing, and I noticed the model predicting things it didn't have labels for. I'm trying to get a list of all possible labels the models support. Is this a known issue? Am I doing something wrong?

Edit: I am using the CPU models on Apple Silicon with v3.5.1

Answered by adrianeboyd

Mar 24, 2023

It's true that the labels aren't all on the model docs page. It's a bit complicated because the pipelines include a mixture of statistical and rule-based components, and they don't all expose labels in the same way.

Are you looking for information about the English pipelines in particular or more generally?

For statistical components: the values in Pipe.labels are internal labels for the statistical models and may not be the exact labels that are saved on tokens or spans. You don't see this much with the components included in the current English pipelines, but more for components like the morphologizer or trainable lemmatizer.

For rule-based components like attribute_ruler and lemmatizer:…

View full answer

adrianeboyd · 2023-03-24T08:16:27Z

adrianeboyd
Mar 24, 2023

It's true that the labels aren't all on the model docs page. It's a bit complicated because the pipelines include a mixture of statistical and rule-based components, and they don't all expose labels in the same way.

Are you looking for information about the English pipelines in particular or more generally?

For statistical components: the values in Pipe.labels are internal labels for the statistical models and may not be the exact labels that are saved on tokens or spans. You don't see this much with the components included in the current English pipelines, but more for components like the morphologizer or trainable lemmatizer.

For rule-based components like attribute_ruler and lemmatizer: these components don't store an internal set of labels (plus lemmatizers can generate new lemmas for novel words, so there's no way to store a list).

The English POS tags come from a tag+dep->pos mapping in the attribute_ruler component.

Unlike all the other attributes, token.pos has a hard-coded list of supported labels (SPACE and EOL are not standard UPOS tags. SPACE is used for whitespace tokens, EOL is unused):

spaCy/spacy/parts_of_speech.pxd

Lines 5 to 24 in 28de857

    
           ADJ = symbols.ADJ 
        
           ADP 
        
           ADV 
        
           AUX 
        
           CONJ 
        
           CCONJ # U20 
        
           DET 
        
           INTJ 
        
           NOUN 
        
           NUM 
        
           PART 
        
           PRON 
        
           PROPN 
        
           PUNCT 
        
           SCONJ 
        
           SYM 
        
           VERB 
        
           X 
        
           EOL 
        
           SPACE

1 reply

siennathesane Mar 24, 2023
Author

I'm looking for more information in general, I'm starting with English since that's my native language. All of this information is really helpful, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Model is predicting extra labels? #12457

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Model is predicting extra labels? #12457

Uh oh!

Uh oh!

siennathesane Mar 23, 2023

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Mar 24, 2023

Uh oh!

siennathesane Mar 24, 2023 Author

siennathesane
Mar 23, 2023

Replies: 1 comment 1 reply

adrianeboyd
Mar 24, 2023

siennathesane Mar 24, 2023
Author