KeyError
when trying to remove rule from PhraseMatcher using match ID
#9874
-
How to reproduce the behaviourUnable to reproduce the behaviour. It happens occasionally in the production environment. DetailsI have a custom component called For every call, I add the pattern with the associated self._phrase_matcher.add(Hints.PERSON.value, [person_hint_pattern]) Therefore, I should be able to remove the pattern based on self._phrase_matcher.remove(Hints.PERSON.value) What am I missing? Thank you for you help 🙏🏽 @Language.factory("hints")
def hints_component(nlp, name):
logger.info(f"Adding {name} component.")
return HintsComponent(nlp=nlp)
class Hints(str, Enum):
PERSON = "PERSON"
ORG = "ORG"
class HintsComponent:
"""HintsComponent uses rule-based phrase matching to tag entities.
Hints are added to the Doc object post tokenization. The HintsComponent:
1. Accesses the hints using doc._.hints.
2. Adds hints as patterns to the PhraseMatcher.
3. If matches are found, use the matches to create entity spans and assign
entity labels to the spans based on the hint type.
4. Removes the patterns (hints) from the PhraseMatcher to ensure subsequent
requests do not use stale hints.
"""
# Add hints extension to Doc object. doc._.hints is set post tokenization
if not Doc.has_extension("hints"):
Doc.set_extension("hints", default={})
def __init__(self, nlp: Language):
"""Initialize the HintsComponent.
Args:
nlp: spaCy language model
"""
self._nlp = nlp
self._phrase_matcher = PhraseMatcher(nlp.vocab,
attr="TEXT",
validate=True)
def __call__(self, doc) -> Doc:
if not doc._.hints:
return doc
hints = doc._.hints
person_hint_pattern = self._nlp(hints[Hints.PERSON])
org_hint_pattern = self._nlp(hints[Hints.ORG])
self._phrase_matcher.add(Hints.PERSON.value, [person_hint_pattern])
self._phrase_matcher.add(Hints.ORG.value, [org_hint_pattern])
matches = self._phrase_matcher(doc, as_spans=True)
# In case of overlapping matches, keep the longest match
doc.ents = spacy.util.filter_spans(matches)
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(ent)
# Remove patterns to ensure that hints from previous requests are not used
# in successive requests
self._phrase_matcher.remove(Hints.PERSON.value)
self._phrase_matcher.remove(Hints.ORG.value)
return doc The component is used like so: import spacy
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(factory_name="hints", name="hints", before="ner")
doc = nlp.make_doc("John Smith works at ACME")
doc._.hints= {"PERSON":"John Smith", "ORG":"ACME"}
doc = nlp(doc) Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
How are you using this in practice? Are you only running it from a simple If you initialize a new phrase matcher in each self._phrase_matcher = PhraseMatcher(nlp.vocab,
attr="TEXT",
validate=True) The phrase matcher init should not be expensive. |
Beta Was this translation helpful? Give feedback.
-
No. I am running this as part of a FastAPI application (with Gunicorn and UvicornWorkers). I load the
I should have done that but was worried about how expensive it would be. I will change that and let you know how it goes. As always, Thank you for always replying so promptly. 🙏🏽 |
Beta Was this translation helpful? Give feedback.
-
Using FastAPI would explain why you were running into problems, since you can end up with a configuration where multiple threads are trying to modify the PhraseMatcher object at the same time. Some relevant FastAPI docs: https://fastapi.tiangolo.com/async/ My guess would be that creating a new phrase matcher each time is not going to be much slower (if at all) than trying to do the remove. |
Beta Was this translation helpful? Give feedback.
-
You are right. Creating new instances of PhraserMatcher every call has negligible overhead. Also, using Thank you @adrianeboyd |
Beta Was this translation helpful? Give feedback.
Using FastAPI would explain why you were running into problems, since you can end up with a configuration where multiple threads are trying to modify the PhraseMatcher object at the same time. Some relevant FastAPI docs: https://fastapi.tiangolo.com/async/
My guess would be that creating a new phrase matcher each time is not going to be much slower (if at all) than trying to do the remove.