-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Closed
Labels
feat / spanrulerFeature: Entity and span rulerFeature: Entity and span ruler
Description
Hello!
First off: Thanks for the great package!
I however, encountered some unexpected behavior while implementing my custom entity ruler.
The entity ruler is made to recognize dutch first names and last names ("achternaam").
However, the code below doesn't catch last names consisting of multiple tokens, while this does work in the regular NER component.
Moreover, the code below does work with singe-token last names.
What I already checked:
- There are no other entities overwriting the names ruler.
- When adding the NER pipeline in the last part, it sees "Andre van walderveen" correctly as a single person.
- When changing the name to Lepelaar, it tags correctly.
# Load base model
nlp= sp.load("nl_core_news_lg", exclude=["tok2vec", "morphologizer", "tagger", "parser", "lemmatizer", "attribute_ruler"])
# Set split on "-"
infixes = nlp.Defaults.infixes + [r'-'] + [r'\('] + [r'\)']
infix_regex = sp.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
# Get available pipelines
nlp.pipe_names
config_names = {
"validate": False,
"overwrite_ents": True,
"ent_id_sep": "||",
}
ruler_names = nlp.add_pipe(
"entity_ruler", "names_ruler", before="ner", config=config_names
)
with nlp.select_pipes(enable="ner"):
ruler_names.add_patterns(
[
{"label": "ACHTERNAAM", "pattern": [{"LOWER": "Lepelaar"}]},
{"label": "ACHTERNAAM", "pattern": [{"LOWER": "van Walderveen"}]}
]
)
# OR
with nlp.select_pipes(enable="ner"):
ruler_names.add_patterns(
[
{"label": "ACHTERNAAM", "pattern": [{"LOWER": "Lepelaar"}]},
{"label": "ACHTERNAAM", "pattern": [{"LOWER": "van"}, {"LOWER": "Walderveen"}]}
]
)
with nlp.select_pipes(enable=["sentencizer", "names_ruler"]):
doc1 = nlp("Ik ben Andre van Walderveen")
for ents in doc1.ents:
print(ents, ents.label_)
Metadata
Metadata
Assignees
Labels
feat / spanrulerFeature: Entity and span rulerFeature: Entity and span ruler