Bug: EntityRuler is failing to pick up entities #11600
-
How to reproduce the behaviourIt appears that the language model is somehow getting in the way of entity recognition. Different words and poor language structure can prevent entities from being recognised. Can someone please point me to the point Spacy test cases and test framework as it seems unusual that such widely used and large Python module is failing with what seems like an obvious test case.
This is for a production system that cannot "miss entities" in this way. #nlp = spacy.load('en_core_web_sm')
temp_ruler = nlp.add_pipe("entity_ruler", name="temp_ruler", validate=True)
ruler = EntityRuler(nlp, validate=True, overwrite_ents=True, phrase_matcher_attr="LOWER")
#globals()["watchlist_ruler_"+str(watchindexx)].name = 'watchlistRuler_'+watchindexx
patterns = [{'label': 'PERSON', 'pattern': 'Entityaus Smith'}]
ruler.add_patterns(patterns)
# WORKING ==============
# target_text = '|<p data-key="1p">| I admire |<strong data-key="2strong">| Entityaus Smith |</strong>| she is great. |</p>|'
# target_text = '|<p data-key="1p">| I admire Entityaus Smith she is great. |</p>|'
# target_text = 'I admire Entityaus Smith she is great'
# target_text = 'I admire Entityaus Smith she is amazing.'
# target_text = 'I like Entityaus Smith, she is amazing'
# target_text = 'I like Entityaus Smith and Bob Smith they are amazing'
# target_text = 'I think Entityaus Smith is amazing'
# target_text = 'This sentence is Entityaus Smith they are amazing'
# target_text = 'I like Entityaus Smith they are amazing'
# target_text = 'I like Entityaus Smith are amazing'
# target_text = 'I like Entityaus Smith. She is amazing'
# target_text = 'I like Entityaus Smith. she is amazing'
# FAILING =============
# target_text = 'I like Entityaus Smith she is amazing'
# target_text = 'I admire entityaus smith she is great'
# target_text = 'I admire entityaus smith she is amazing'
# target_text = 'I like Entityaus Smith she is amazing.'
# target_text = 'I like entityaus smith she is amazing'
# target_text = '<p>I like Entityaus Smith she is amazing.</p>'
target_text = 'I like Entityaus Smith she is great'
# target_text = 'I admire Entityaus Smith she is great' # WORKING - one word difference
processed_field = nlp(target_text)
# Should have found Entityaus Smith - but doesn't
print(str(processed_field.ents)) Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @chillwinston, let me address
first. You are assigning your new component to
You can check out our
We do have a CI pipeline on GitHub (and various other test pipelines internally) ensuring that all of our tests pass. Feel free to check out the CI config. |
Beta Was this translation helpful? Give feedback.
-
That's excellent, and many thanks Raphael! - an excellent answer. It's still a little odd that it partly works using the older style, as a preferred response from Spacy would be that it fails completely rather than partially working, however I do agree that I wasn't using the API correctly. |
Beta Was this translation helpful? Give feedback.
Hi @chillwinston, let me address
first. You are assigning your new component to
temp_ruler
- but then proceed to create a differentEntityRuler
, assign it toruler
, and add your patterns to the latter.ruler
is not part of your pipeline, hence the search pattern is not applied. One way to do this properly is to pass your config options asconfig
tonlp.add_pipe()
. Below is a corrected example. If you run, you'll see that all listed examples are processed properly.