Entity ruler doesn't catch multi-token entities #12271

SjoerdBraaksma · 2023-02-10T13:42:07Z

SjoerdBraaksma
Feb 10, 2023

Hello!

First off: Thanks for the great package!
I however, encountered some unexpected behavior while implementing my custom entity ruler.
The entity ruler is made to recognize dutch first names and last names ("achternaam").
However, the code below doesn't catch last names consisting of multiple tokens, while this does work in the regular NER component.
Moreover, the code below does work with singe-token last names.

What I already checked:

There are no other entities overwriting the names ruler.
When adding the NER pipeline in the last part, it sees "Andre van walderveen" correctly as a single person.
When changing the name to Lepelaar, it tags correctly.

# Load base model
nlp= sp.load("nl_core_news_lg", exclude=["tok2vec", "morphologizer", "tagger", "parser", "lemmatizer", "attribute_ruler"])

# Set split on "-"
infixes = nlp.Defaults.infixes + [r'-'] + [r'\('] + [r'\)']
infix_regex = sp.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer


# Get available pipelines
nlp.pipe_names

config_names = {
        "validate": False,
        "overwrite_ents": True,
        "ent_id_sep": "||",
    }

ruler_names = nlp.add_pipe(
            "entity_ruler", "names_ruler", before="ner", config=config_names
        )

with nlp.select_pipes(enable="ner"):
    ruler_names.add_patterns(
        [
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "Lepelaar"}]},
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "van Walderveen"}]}
            ]
            )
# OR
with nlp.select_pipes(enable="ner"):
    ruler_names.add_patterns(
        [
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "Lepelaar"}]},
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "van"}, {"LOWER": "Walderveen"}]}
            ]
            )

with nlp.select_pipes(enable=["sentencizer", "names_ruler"]):
    doc1 = nlp("Ik ben Andre van Walderveen")

for ents in doc1.ents:
    print(ents, ents.label_)

Answered by rmitsch

Feb 10, 2023

Hi @SjoerdBraaksma, two issues with this:

Patterns specified for LOWER: should be in lowercase. I'm surprised that "Lepelaar" is recognized for you, because it doesn't work for when running your code (and I wouldn't expect it to).
As you already noticed, the problem is related to "van Walderveen" being more than one token. This has to be specified differently.

If you make the following adjustments, this will work:

    ...
    with nlp.select_pipes(enable="ner"):
        ruler_names.add_patterns(
            [
                {"label": "ACHTERNAAM", "pattern": [{"LOWER": "lepelaar"}]},
                {"label": "ACHTERNAAM", "pattern": [{"LOWER": "van"}, {"LOWER": "walderveen"}]}
       …

View full answer

SjoerdBraaksma · 2023-02-10T14:25:21Z

SjoerdBraaksma
Feb 10, 2023
Author

Probably because of some old variables floating around (no idea how, I restarted the kernel multiple times) The code where each word is separated now works after restarting VS code completely.

0 replies

rmitsch · 2023-02-10T14:26:23Z

rmitsch
Feb 10, 2023

Hi @SjoerdBraaksma, two issues with this:

Patterns specified for LOWER: should be in lowercase. I'm surprised that "Lepelaar" is recognized for you, because it doesn't work for when running your code (and I wouldn't expect it to).
As you already noticed, the problem is related to "van Walderveen" being more than one token. This has to be specified differently.

If you make the following adjustments, this will work:

    ...
    with nlp.select_pipes(enable="ner"):
        ruler_names.add_patterns(
            [
                {"label": "ACHTERNAAM", "pattern": [{"LOWER": "lepelaar"}]},
                {"label": "ACHTERNAAM", "pattern": [{"LOWER": "van"}, {"LOWER": "walderveen"}]}
            ]
        )
    ...
    # Test:
   with nlp.select_pipes(enable=["sentencizer", "names_ruler"]):
        doc1 = nlp("Ik ben Andre Lepelaar")
        doc2 = nlp("Ik ben Andre van Walderveen")

    for ents in doc1.ents:
        print(ents, ents.label_)
    print("---")
    for ents in doc2.ents:
        print(ents, ents.label_)

0 replies

rmitsch · 2023-02-10T14:27:28Z

rmitsch
Feb 10, 2023

Converting this to a discussion, as this is a usage question and not an issue with spaCy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Entity ruler doesn't catch multi-token entities #12271

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Entity ruler doesn't catch multi-token entities #12271

Uh oh!

Uh oh!

SjoerdBraaksma Feb 10, 2023

Replies: 3 comments

Uh oh!

SjoerdBraaksma Feb 10, 2023 Author

Uh oh!

rmitsch Feb 10, 2023

Uh oh!

rmitsch Feb 10, 2023

SjoerdBraaksma
Feb 10, 2023

SjoerdBraaksma
Feb 10, 2023
Author

rmitsch
Feb 10, 2023

rmitsch
Feb 10, 2023