EntityRuler #6447

lfoppiano · 2020-11-25T13:35:54Z

lfoppiano
Nov 25, 2020

After #6309 I have another small issue (not sure it's an issue) with the EntityRuler.

Is it normal that the following pattern:

{"label":"BAO_BIO_RB_syn","pattern":[{"LOWER":"Brain"},{"LOWER":"tumour"}],"id":"http://purl.obolibrary.org/obo/DOID_1319"}

does match the text brain tumour, Brain tumour, but not BRAIN TUMOUR? is it because the B uppercase?

the following patterns

{"label":"BAO_BIO_RB_syn","pattern":[{"LOWER":"brain"},{"LOWER":"tumour"}],"id":"http://purl.obolibrary.org/obo/DOID_1319"}
{"label":"BAO_BIO_RB_syn","pattern":[{"LEMMA":"brain"},{"LEMMA":"tumour"}],"id":"http://purl.obolibrary.org/obo/DOID_1319"}

Does match "brain tumour", BRAIN TUMOUR, brain tumours but not BRAIN TUMOURS, any idea why?

related to EntityRuler from_disk method doesn't work when phrase_matcher_attr is specified #4651, if I read the entityRuler from disk and initialise spacy with a pretrained model, do I still need to save and load back the vocab? I imagine not, but I want to be sure.

Info about spaCy

spaCy version: 2.3.2
Platform: macOS-10.15.7-x86_64-i386-64bit
Python version: 3.8.5

Answered by adrianeboyd

Nov 26, 2020

The pattern {"LOWER": "Brain"} shouldn't ever match anything, so double-check your first pattern to see if you can reproduce this? If you still think it's a bug, can you provide a minimal working example that shows this problem in a short script we can run?

The lemma pattern is trickier because it depends on the lemma assigned by your model, which depends on the tagger and the lemmatizer. If you look carefully at the token.lemma_ values in your docs before you run the Matcher, you should be able to track down what's going on.

In terms of the vocab, you just need to be sure that the pipeline (nlp), entity ruler and the docs are all using the same vocab, so:

nlp = spacy.load(model)
ruler = …

View full answer

adrianeboyd · 2020-11-26T09:40:56Z

adrianeboyd
Nov 26, 2020

The pattern {"LOWER": "Brain"} shouldn't ever match anything, so double-check your first pattern to see if you can reproduce this? If you still think it's a bug, can you provide a minimal working example that shows this problem in a short script we can run?

The lemma pattern is trickier because it depends on the lemma assigned by your model, which depends on the tagger and the lemmatizer. If you look carefully at the token.lemma_ values in your docs before you run the Matcher, you should be able to track down what's going on.

In terms of the vocab, you just need to be sure that the pipeline (nlp), entity ruler and the docs are all using the same vocab, so:

nlp = spacy.load(model)
ruler = EntityRuler(nlp).from_disk(path)
nlp.add_pipe(ruler)
doc = nlp(text)

As an example with spacy v2.3.2 showing that {"LOWER": "Brain"} doesn't match anything:

import spacy
from spacy.matcher import Matcher

nlp = spacy.blank("en")

matcher = Matcher(nlp.vocab)
matcher.add("Brain", [[{"LOWER": "Brain"}, {"LOWER": "tumour"}]])
matcher.add("brain", [[{"LOWER": "brain"}, {"LOWER": "tumour"}]])

for text in ["brain tumour", "Brain tumour", "BRAIN TUMOUR"]:
    for match_id, start, end in matcher(nlp(text)):
        print(nlp.vocab.strings[match_id], "matched:", text)

Output:

brain matched: brain tumour
brain matched: Brain tumour
brain matched: BRAIN TUMOUR

0 replies

lfoppiano · 2020-11-30T12:53:02Z

lfoppiano
Nov 30, 2020
Author

thanks for the answer, @adrianeboyd

As far as I understood, I had patterns that were something like:
matcher.add("1", [[{"LOWER":"brain"},{"LOWER":"tumour"}]])
matcher.add("2", [[{"LEMMA":"brain"},{"LEMMA":"tumour"}]])

and I was trying to match "BRAIN TUMOURS", assuming that the lemma would have applied also the lowercase, which I guess is not the case.

In this case, my solution is to lowercase the strings before calling the matcher or entity ruler.
If I have said something wrong, please don't hesitate to correct me. 😅

Thanks!

0 replies

adrianeboyd · 2020-11-30T13:04:24Z

adrianeboyd
Nov 30, 2020

Glad to hear it's working!

No, lemmas aren't necessarily lowercase. (This is part of why I recommended LOWER over LEMMA in the other issue. The PROPN vs. NOUN confusion is too frequent for the tagger for things like BRAIN and the lemma depends on the POS.)

Be aware that lowercasing your texts might lead to other unexpected results with the tagger, but you can decide what works best for your task!

0 replies

lfoppiano · 2020-11-30T13:22:04Z

lfoppiano
Nov 30, 2020
Author

Indeed. Thanks for the advice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EntityRuler #6447

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

EntityRuler #6447

Uh oh!

lfoppiano Nov 25, 2020

Info about spaCy

Replies: 4 comments

Uh oh!

adrianeboyd Nov 26, 2020

Uh oh!

lfoppiano Nov 30, 2020 Author

Uh oh!

adrianeboyd Nov 30, 2020

Uh oh!

lfoppiano Nov 30, 2020 Author

lfoppiano
Nov 25, 2020

adrianeboyd
Nov 26, 2020

lfoppiano
Nov 30, 2020
Author

adrianeboyd
Nov 30, 2020

lfoppiano
Nov 30, 2020
Author