Model is predicting extra labels? #12457
-
I noticed the pre-trained pipelines will predict labels which aren't explicitly mentioned in the models doc page, and aren't referenced in the models themselves. Here's a reproducible example with JSON output: import spacy
from json import dumps
models = ["sm", "md", "lg", "trf"]
text = "The third oldest French bridge."
all_known_labels = {}
for model in models:
model = "en_core_web_{0}".format(model)
nlp = spacy.load(model)
all_known_labels[model] = {}
all_known_labels[model]["labels"] = [{pipe[0]: list(nlp.get_pipe(pipe[0]).labels)} for pipe in nlp.components]
all_known_labels[model]["parsed"] = nlp(text).to_json()
print(dumps(all_known_labels, indent=2)) The model's pipelines aren't aware/configured/trained on the I'm working on entity extraction, so adjectives describe the properties of a thing, and I noticed the model predicting things it didn't have labels for. I'm trying to get a list of all possible labels the models support. Is this a known issue? Am I doing something wrong? Edit: I am using the CPU models on Apple Silicon with v3.5.1 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
It's true that the labels aren't all on the model docs page. It's a bit complicated because the pipelines include a mixture of statistical and rule-based components, and they don't all expose labels in the same way. Are you looking for information about the English pipelines in particular or more generally? For statistical components: the values in For rule-based components like The English POS tags come from a tag+dep->pos mapping in the Unlike all the other attributes, spaCy/spacy/parts_of_speech.pxd Lines 5 to 24 in 28de857 |
Beta Was this translation helpful? Give feedback.
It's true that the labels aren't all on the model docs page. It's a bit complicated because the pipelines include a mixture of statistical and rule-based components, and they don't all expose labels in the same way.
Are you looking for information about the English pipelines in particular or more generally?
For statistical components: the values in
Pipe.labels
are internal labels for the statistical models and may not be the exact labels that are saved on tokens or spans. You don't see this much with the components included in the current English pipelines, but more for components like the morphologizer or trainable lemmatizer.For rule-based components like
attribute_ruler
andlemmatizer
:…