Skipping entity ... in the following text because the character span does not align with token boundaries

I'm using spacy v 3.2.1 with python 3.8.8 on ubuntu 20.04
I see these warning when I start training NER, the problem is with punctuation marks at the beginning or end of entity, specifically an entity that begin with dash or end with period, I've plugged my own tokenizer to the training and I know that my tokenizer separate between these two cases and the entity 

I'm using trankit tokenizer, an example:
```
trankit_nlp=trankit.Pipeline("hebrew", gpu=False)
sentence='האבחנה נקבעת על פי רמת הטריגליצרידים (יותר מ-200 מ"ג דל) בנוזל המיימת.'
# this is how I use the tokenizer:
ht_tokens = trankit_nlp.tokenize(sentence)
tokens=list()
for x in ht_tokens["sentences"][0]["tokens"]:
    if "expanded" in x.keys():
        tokens += [xt["text"] for xt in x["expanded"]]
    else:
        tokens.append(x["text"])

print(tokens)

```
This is the result:
['ה', 'אבחנה', 'נקבעת', 'על', 'פי', 'רמת', 'ה', 'טריגליצרידים', '(', 'יותר', 'מ', '-', '200', 'מ"ג', 'דל', ')', 'ב', 'נוזל', 'ה', 'מיימת', '.']
sorry about the reverse [, it's because the Hebrew right-to-left
the warning I get from the training is about the dash before the 200, as you can see the tokenizer separates the dash, I also see the same behavior when I've got period attached to the entity in end-of-sentence, the indices of the entity are correct 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Skipping entity ... in the following text because the character span does not align with token boundaries #10318

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Skipping entity ... in the following text because the character span does not align with token boundaries #10318

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions