Skip to content

Skipping entity ... in the following text because the character span does not align with token boundaries #10318

@dmoti

Description

@dmoti

I'm using spacy v 3.2.1 with python 3.8.8 on ubuntu 20.04
I see these warning when I start training NER, the problem is with punctuation marks at the beginning or end of entity, specifically an entity that begin with dash or end with period, I've plugged my own tokenizer to the training and I know that my tokenizer separate between these two cases and the entity

I'm using trankit tokenizer, an example:

trankit_nlp=trankit.Pipeline("hebrew", gpu=False)
sentence='האבחנה נקבעת על פי רמת הטריגליצרידים (יותר מ-200 מ"ג דל) בנוזל המיימת.'
# this is how I use the tokenizer:
ht_tokens = trankit_nlp.tokenize(sentence)
tokens=list()
for x in ht_tokens["sentences"][0]["tokens"]:
    if "expanded" in x.keys():
        tokens += [xt["text"] for xt in x["expanded"]]
    else:
        tokens.append(x["text"])

print(tokens)

This is the result:
['ה', 'אבחנה', 'נקבעת', 'על', 'פי', 'רמת', 'ה', 'טריגליצרידים', '(', 'יותר', 'מ', '-', '200', 'מ"ג', 'דל', ')', 'ב', 'נוזל', 'ה', 'מיימת', '.']
sorry about the reverse [, it's because the Hebrew right-to-left
the warning I get from the training is about the dash before the 200, as you can see the tokenizer separates the dash, I also see the same behavior when I've got period attached to the entity in end-of-sentence, the indices of the entity are correct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions