Skip to content

German Entity Recognition incorrect star_char/end_char #12488

@lsmith77

Description

@lsmith77

How to reproduce the behaviour

model = spacy.load("de_core_news_lg")
# "Freundliche Grüße Nadia" (start_char: 0, end_char: 22, label_: "PER")
tokens = model("Freundliche Grüße Nadia")

# "Freundliche Grüße" (start_char: 0, end_char: 16, label_: "PER")
# "Nadia" (start_char: 32, end_char: 36, label_: MISC)
tokens = model("Freundliche Grüße meine liebste Nadia") 

# "Hallo Herr Müller" (start_char: 0, end_char: 16, label_: "PER")
tokens = model("Hallo Herr Müller.")

# no entities recognized
tokens = model("Hallo Herr Müller")

In the comments above we are examining tokens.ents.

Note for English we were in a position to reproduce the documented behavior, where it would accurately identify the entity.

That being said, in English we noticed some issues with typos, however using en_core_web_trf solved those.

Unfortunately de_dep_news_trf does not support entities.

Your Environment

  • Operating System: OSX
  • Python Version Used: 3.11
  • spaCy Version Used: 3.5.1 (de-core-news-lg)
  • Environment Information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat / nerFeature: Named Entity Recognizerlang / deGerman language data and modelsperf / accuracyPerformance: accuracy

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions