EntityRuler #6447
-
|
After #6309 I have another small issue (not sure it's an issue) with the EntityRuler.
does match the text
Does match "brain tumour",
Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
|
The pattern The lemma pattern is trickier because it depends on the lemma assigned by your model, which depends on the tagger and the lemmatizer. If you look carefully at the In terms of the vocab, you just need to be sure that the pipeline ( As an example with spacy v2.3.2 showing that import spacy
from spacy.matcher import Matcher
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("Brain", [[{"LOWER": "Brain"}, {"LOWER": "tumour"}]])
matcher.add("brain", [[{"LOWER": "brain"}, {"LOWER": "tumour"}]])
for text in ["brain tumour", "Brain tumour", "BRAIN TUMOUR"]:
for match_id, start, end in matcher(nlp(text)):
print(nlp.vocab.strings[match_id], "matched:", text)Output: |
Beta Was this translation helpful? Give feedback.
-
|
thanks for the answer, @adrianeboyd As far as I understood, I had patterns that were something like: and I was trying to match "BRAIN TUMOURS", assuming that the lemma would have applied also the lowercase, which I guess is not the case. In this case, my solution is to lowercase the strings before calling the matcher or entity ruler. Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Glad to hear it's working! No, lemmas aren't necessarily lowercase. (This is part of why I recommended Be aware that lowercasing your texts might lead to other unexpected results with the tagger, but you can decide what works best for your task! |
Beta Was this translation helpful? Give feedback.
-
|
Indeed. Thanks for the advice. |
Beta Was this translation helpful? Give feedback.
The pattern
{"LOWER": "Brain"}shouldn't ever match anything, so double-check your first pattern to see if you can reproduce this? If you still think it's a bug, can you provide a minimal working example that shows this problem in a short script we can run?The lemma pattern is trickier because it depends on the lemma assigned by your model, which depends on the tagger and the lemmatizer. If you look carefully at the
token.lemma_values in your docs before you run the Matcher, you should be able to track down what's going on.In terms of the vocab, you just need to be sure that the pipeline (
nlp), entity ruler and the docs are all using the same vocab, so: