Skip to content

Entity ruler doesn't catch multi-token entities #12270

@SjoerdBraaksma

Description

@SjoerdBraaksma

Hello!

First off: Thanks for the great package!
I however, encountered some unexpected behavior while implementing my custom entity ruler.
The entity ruler is made to recognize dutch first names and last names ("achternaam").
However, the code below doesn't catch last names consisting of multiple tokens, while this does work in the regular NER component.
Moreover, the code below does work with singe-token last names.

What I already checked:

  1. There are no other entities overwriting the names ruler.
  2. When adding the NER pipeline in the last part, it sees "Andre van walderveen" correctly as a single person.
  3. When changing the name to Lepelaar, it tags correctly.
# Load base model
nlp= sp.load("nl_core_news_lg", exclude=["tok2vec", "morphologizer", "tagger", "parser", "lemmatizer", "attribute_ruler"])

# Set split on "-"
infixes = nlp.Defaults.infixes + [r'-'] + [r'\('] + [r'\)']
infix_regex = sp.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer


# Get available pipelines
nlp.pipe_names

config_names = {
        "validate": False,
        "overwrite_ents": True,
        "ent_id_sep": "||",
    }

ruler_names = nlp.add_pipe(
            "entity_ruler", "names_ruler", before="ner", config=config_names
        )

with nlp.select_pipes(enable="ner"):
    ruler_names.add_patterns(
        [
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "Lepelaar"}]},
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "van Walderveen"}]}
            ]
            )
# OR
with nlp.select_pipes(enable="ner"):
    ruler_names.add_patterns(
        [
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "Lepelaar"}]},
            {"label": "ACHTERNAAM", "pattern": [{"LOWER": "van"}, {"LOWER": "Walderveen"}]}
            ]
            )

with nlp.select_pipes(enable=["sentencizer", "names_ruler"]):
    doc1 = nlp("Ik ben Andre van Walderveen")

for ents in doc1.ents:
    print(ents, ents.label_)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions