Skipping entity ... in the following text because the character span does not align with token boundaries #10331
Replies: 3 comments 4 replies
-
So just to be clear, the error you're getting normally means that you have an entity that looks like this, assuming entity boundaries are indicated by brackets:
Because the entity boundary comes in the middle of a token, and every token needs exactly one entity label, a label covering half a token isn't usable. You say that you're using a custom tokenizer, but it's not clear how your sample code is integrated with spaCy. Can you share your config and how you customized the tokenizer? Can you give some example data? It isn't clear to me from your description of your custom tokenizer or your example sentence where the problem would be, partly because you don't provide any entity data. I understand you may be unable to share your data, but if you do have a public repo with a small version of the problem we would be glad to take a look at it. |
Beta Was this translation helpful? Give feedback.
-
I tried to replace the default spacy tokenizer with my own (using trankit) and I'm still getting errors in convert, for example this is my sentence: (notice that the period appear on the right, it actually on the left)
This is how the tokenization looks like:
when I run convert I'm getting the following error: it misses the last letter of the entity, but the letter exist in the token and the span is correct |
Beta Was this translation helpful? Give feedback.
-
I prepared a package with example code, I included requirements.txt for the virtual env
you can see that the doc.text has 3 added spaces, the added spaces are in position 4 and before and after the last period |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using spacy v 3.2.1 with python 3.8.8 on ubuntu 20.04
I see these warning when I start training NER, the problem is with punctuation marks at the beginning or end of entity, specifically an entity that begin with dash or end with period, I've plugged my own tokenizer to the training and I know that my tokenizer separate between these two cases and the entity
I'm using trankit tokenizer, an example:
This is the result:
['ה', 'אבחנה', 'נקבעת', 'על', 'פי', 'רמת', 'ה', 'טריגליצרידים', '(', 'יותר', 'מ', '-', '200', 'מ"ג', 'דל', ')', 'ב', 'נוזל', 'ה', 'מיימת', '.']
sorry about the reverse [, it's because the Hebrew right-to-left
the warning I get from the training is about the dash before the 200, as you can see the tokenizer separates the dash, I also see the same behavior when I've got period attached to the entity in end-of-sentence, the indices of the entity are correct
Beta Was this translation helpful? Give feedback.
All reactions