Bug: EntityRuler is failing to pick up entities #11600

chillwinston · 2022-10-09T23:42:08Z

chillwinston
Oct 9, 2022

How to reproduce the behaviour

It appears that the language model is somehow getting in the way of entity recognition. Different words and poor language structure can prevent entities from being recognised.

Can someone please point me to the point Spacy test cases and test framework as it seems unusual that such widely used and large Python module is failing with what seems like an obvious test case.

Is the entity ruler tested by loading large numbers of entities and confirming that all of them are found in a variety of text styles and structures?
What is Spacy's approach to DevOps build and automated test prior to release?

This is for a production system that cannot "miss entities" in this way.

#nlp = spacy.load('en_core_web_sm')
temp_ruler = nlp.add_pipe("entity_ruler", name="temp_ruler", validate=True)
ruler = EntityRuler(nlp, validate=True, overwrite_ents=True, phrase_matcher_attr="LOWER")
#globals()["watchlist_ruler_"+str(watchindexx)].name = 'watchlistRuler_'+watchindexx
patterns = [{'label': 'PERSON', 'pattern': 'Entityaus Smith'}]
ruler.add_patterns(patterns)

#  WORKING ==============
# target_text = '|<p data-key="1p">| I admire  |<strong data-key="2strong">| Entityaus Smith |</strong>|  she is great. |</p>|'
# target_text = '|<p data-key="1p">| I admire  Entityaus Smith she is great. |</p>|'
# target_text = 'I admire Entityaus Smith she is great'
# target_text = 'I admire Entityaus Smith she is amazing.'
# target_text = 'I like Entityaus Smith, she is amazing'
# target_text = 'I like Entityaus Smith and Bob Smith they are amazing'
# target_text = 'I think Entityaus Smith is amazing'
# target_text = 'This sentence is Entityaus Smith they are amazing'
# target_text = 'I like Entityaus Smith they are amazing'
# target_text = 'I like Entityaus Smith are amazing'
# target_text = 'I like Entityaus Smith. She is amazing'
# target_text = 'I like Entityaus Smith. she is amazing'

# FAILING =============
# target_text = 'I like Entityaus Smith she is amazing'
# target_text = 'I admire entityaus smith she is great'
# target_text = 'I admire entityaus smith she is amazing'
# target_text = 'I like Entityaus Smith she is amazing.'
# target_text = 'I like entityaus smith she is amazing'
# target_text = '<p>I like Entityaus Smith she is amazing.</p>'

target_text = 'I like Entityaus Smith she is great'
# target_text = 'I admire Entityaus Smith she is great'   # WORKING - one word difference
processed_field = nlp(target_text)
# Should have found Entityaus Smith - but doesn't
print(str(processed_field.ents))

Your Environment

Operating System: Windows 10, ~12G RAM
Python Version Used: User Current Version:- 3.8.13 (default, Mar 17 2022, 10:20:15) [MSC v.1900 64 bit (AMD64)]
spaCy Version Used: Version: 3.4.1 - pip show spacy
Environment Information:

Answered by rmitsch

Oct 10, 2022

Hi @chillwinston, let me address

Different words and poor language structure can prevent entities from being recognised.

first. You are assigning your new component to temp_ruler - but then proceed to create a different EntityRuler, assign it to ruler, and add your patterns to the latter. ruler is not part of your pipeline, hence the search pattern is not applied. One way to do this properly is to pass your config options as config to nlp.add_pipe(). Below is a corrected example. If you run, you'll see that all listed examples are processed properly.

nlp = spacy.load('en_core_web_sm')
ruler = nlp.add_pipe(
    "entity_ruler",
    name="temp_ruler",
    validate=True,
    config={"valida…

View full answer

rmitsch · 2022-10-10T07:25:47Z

rmitsch
Oct 10, 2022
Maintainer

Hi @chillwinston, let me address

Different words and poor language structure can prevent entities from being recognised.

first. You are assigning your new component to temp_ruler - but then proceed to create a different EntityRuler, assign it to ruler, and add your patterns to the latter. ruler is not part of your pipeline, hence the search pattern is not applied. One way to do this properly is to pass your config options as config to nlp.add_pipe(). Below is a corrected example. If you run, you'll see that all listed examples are processed properly.

nlp = spacy.load('en_core_web_sm')
ruler = nlp.add_pipe(
    "entity_ruler",
    name="temp_ruler",
    validate=True,
    config={"validate": True, "overwrite_ents": True, "phrase_matcher_attr": "LOWER"}
)
patterns = [{'label': 'PERSON', 'pattern': 'Entityaus Smith'}]
ruler.add_patterns(patterns)

for target_text in [
    '|<p data-key="1p">| I admire  |<strong data-key="2strong">| Entityaus Smith |</strong>|  she is great. |</p>|',
    '|<p data-key="1p">| I admire  Entityaus Smith she is great. |</p>|',
    'I admire Entityaus Smith she is great',
    'I admire Entityaus Smith she is amazing.',
    'I like Entityaus Smith, she is amazing',
    'I like Entityaus Smith and Bob Smith they are amazing',
    'I think Entityaus Smith is amazing',
    'This sentence is Entityaus Smith they are amazing',
    'I like Entityaus Smith they are amazing',
    'I like Entityaus Smith are amazing',
    'I like Entityaus Smith. She is amazing',
    'I like Entityaus Smith. she is amazing',
    # Previously failing cases from here.
    'I like Entityaus Smith she is amazing',
    'I admire entityaus smith she is great',
    'I admire entityaus smith she is amazing',
    'I like Entityaus Smith she is amazing.',
    'I like entityaus smith she is amazing',
    '<p>I like Entityaus Smith she is amazing.</p>',
]:
    assert "entityaus smith" in [str(ent).lower() for ent in nlp(target_text).ents]

Is the entity ruler tested by loading large numbers of entities and confirming that all of them are found in a variety of text styles and structures?

You can check out our EntityRuler tests here.

What is Spacy's approach to DevOps build and automated test prior to release?

We do have a CI pipeline on GitHub (and various other test pipelines internally) ensuring that all of our tests pass. Feel free to check out the CI config.

0 replies

chillwinston · 2022-10-12T02:34:33Z

chillwinston
Oct 12, 2022
Author

That's excellent, and many thanks Raphael! - an excellent answer. It's still a little odd that it partly works using the older style, as a preferred response from Spacy would be that it fails completely rather than partially working, however I do agree that I wasn't using the API correctly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bug: EntityRuler is failing to pick up entities #11600

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Bug: EntityRuler is failing to pick up entities #11600

Uh oh!

Uh oh!

chillwinston Oct 9, 2022

How to reproduce the behaviour

Your Environment

Replies: 2 comments

Uh oh!

Uh oh!

rmitsch Oct 10, 2022 Maintainer

Uh oh!

chillwinston Oct 12, 2022 Author

chillwinston
Oct 9, 2022

rmitsch
Oct 10, 2022
Maintainer

chillwinston
Oct 12, 2022
Author