`KeyError` when trying to remove rule from PhraseMatcher using match ID #9874

raqibhayder · 2021-12-14T22:59:34Z

raqibhayder
Dec 14, 2021

How to reproduce the behaviour

Unable to reproduce the behaviour. It happens occasionally in the production environment.

Details

I have a custom component called hints which I use to match entities before passing it to the NER component in the pipeline. From time to time I get KeyError when trying to remove the match ID from the PhraseMatcher instance. I know I can fix this by checking if the match ID exists before removing but I want to understand the reason for such behaviour.

For every call, I add the pattern with the associated match ID. For example like so:

self._phrase_matcher.add(Hints.PERSON.value, [person_hint_pattern])

Therefore, I should be able to remove the pattern based on match ID too. For example like so:

self._phrase_matcher.remove(Hints.PERSON.value)

What am I missing?

Thank you for you help 🙏🏽

@Language.factory("hints")
def hints_component(nlp, name):
    logger.info(f"Adding {name} component.")
    return HintsComponent(nlp=nlp)


class Hints(str, Enum):
    PERSON = "PERSON"
    ORG = "ORG"


class HintsComponent:
    """HintsComponent uses rule-based phrase matching to tag entities.

    Hints are added to the Doc object post tokenization. The HintsComponent:
    1. Accesses the hints using doc._.hints.
    2. Adds hints as patterns to the PhraseMatcher.
    3. If matches are found, use the matches to create entity spans and assign
       entity labels to the spans based on the hint type.
    4. Removes the patterns (hints) from the PhraseMatcher to ensure subsequent
       requests do not use stale hints.
    """

    # Add hints extension to Doc object. doc._.hints is set post tokenization
    if not Doc.has_extension("hints"):
        Doc.set_extension("hints", default={})

    def __init__(self, nlp: Language):
        """Initialize the HintsComponent.

        Args:
            nlp: spaCy language model
        """

        self._nlp = nlp
        self._phrase_matcher = PhraseMatcher(nlp.vocab,
                                             attr="TEXT",
                                             validate=True)

    def __call__(self, doc) -> Doc:
        if not doc._.hints:
            return doc

        hints = doc._.hints

        person_hint_pattern = self._nlp(hints[Hints.PERSON])
        org_hint_pattern = self._nlp(hints[Hints.ORG])

        self._phrase_matcher.add(Hints.PERSON.value, [person_hint_pattern])
        self._phrase_matcher.add(Hints.ORG.value, [org_hint_pattern])

        matches = self._phrase_matcher(doc, as_spans=True)
        # In case of overlapping matches, keep the longest match
        doc.ents = spacy.util.filter_spans(matches)

        with doc.retokenize() as retokenizer:
            for ent in doc.ents:
                retokenizer.merge(ent)

        # Remove patterns to ensure that hints from previous requests are not used
        # in successive requests
        self._phrase_matcher.remove(Hints.PERSON.value)
        self._phrase_matcher.remove(Hints.ORG.value)


        return doc

The component is used like so:

import spacy 
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(factory_name="hints", name="hints", before="ner")
doc = nlp.make_doc("John Smith works at ACME")
doc._.hints= {"PERSON":"John Smith",  "ORG":"ACME"}
doc = nlp(doc)

Your Environment

Operating System: Debian
Python Version Used: 3.9.7
spaCy Version Used: 3.2.1

Answered by adrianeboyd

Dec 16, 2021

Using FastAPI would explain why you were running into problems, since you can end up with a configuration where multiple threads are trying to modify the PhraseMatcher object at the same time. Some relevant FastAPI docs: https://fastapi.tiangolo.com/async/

My guess would be that creating a new phrase matcher each time is not going to be much slower (if at all) than trying to do the remove.

View full answer

adrianeboyd · 2021-12-15T08:29:03Z

adrianeboyd
Dec 15, 2021

How are you using this in practice? Are you only running it from a simple python script.py as in the example?

If you initialize a new phrase matcher in each __call__ instead of __init__ (so no need to use remove) does this solve the problem?

        self._phrase_matcher = PhraseMatcher(nlp.vocab,
                                             attr="TEXT",
                                             validate=True)

The phrase matcher init should not be expensive.

0 replies

raqibhayder · 2021-12-15T19:46:09Z

raqibhayder
Dec 15, 2021
Author

How are you using this in practice? Are you only running it from a simple python script.py as in the example?

No. I am running this as part of a FastAPI application (with Gunicorn and UvicornWorkers). I load the en_core_web_md and add the component to the pipeline like above.

If you initialize a new phrase matcher in each call instead of init (so no need to use remove) does this solve the problem?

I should have done that but was worried about how expensive it would be. I will change that and let you know how it goes.

As always, Thank you for always replying so promptly. 🙏🏽

0 replies

adrianeboyd · 2021-12-16T09:54:11Z

adrianeboyd
Dec 16, 2021

Using FastAPI would explain why you were running into problems, since you can end up with a configuration where multiple threads are trying to modify the PhraseMatcher object at the same time. Some relevant FastAPI docs: https://fastapi.tiangolo.com/async/

My guess would be that creating a new phrase matcher each time is not going to be much slower (if at all) than trying to do the remove.

0 replies

raqibhayder · 2021-12-16T19:41:46Z

raqibhayder
Dec 16, 2021
Author

My guess would be that creating a new phrase matcher each time is not going to be much slower (if at all) than trying to do the remove.

You are right. Creating new instances of PhraserMatcher every call has negligible overhead. Also, using Language.make_doc speeds everything up significantly.

Thank you @adrianeboyd

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`KeyError` when trying to remove rule from PhraseMatcher using match ID #9874

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

KeyError when trying to remove rule from PhraseMatcher using match ID #9874

Uh oh!

Uh oh!

raqibhayder Dec 14, 2021

How to reproduce the behaviour

Details

Your Environment

Replies: 4 comments

Uh oh!

adrianeboyd Dec 15, 2021

Uh oh!

raqibhayder Dec 15, 2021 Author

Uh oh!

adrianeboyd Dec 16, 2021

Uh oh!

raqibhayder Dec 16, 2021 Author

`KeyError` when trying to remove rule from PhraseMatcher using match ID #9874

raqibhayder
Dec 14, 2021

adrianeboyd
Dec 15, 2021

raqibhayder
Dec 15, 2021
Author

adrianeboyd
Dec 16, 2021

raqibhayder
Dec 16, 2021
Author