Skipping entity ... in the following text because the character span does not align with token boundaries #10331

dmoti · 2022-02-17T09:27:51Z

dmoti
Feb 17, 2022

I'm using spacy v 3.2.1 with python 3.8.8 on ubuntu 20.04
I see these warning when I start training NER, the problem is with punctuation marks at the beginning or end of entity, specifically an entity that begin with dash or end with period, I've plugged my own tokenizer to the training and I know that my tokenizer separate between these two cases and the entity

I'm using trankit tokenizer, an example:

trankit_nlp=trankit.Pipeline("hebrew", gpu=False)
sentence='האבחנה נקבעת על פי רמת הטריגליצרידים (יותר מ-200 מ"ג דל) בנוזל המיימת.'
# this is how I use the tokenizer:
ht_tokens = trankit_nlp.tokenize(sentence)
tokens=list()
for x in ht_tokens["sentences"][0]["tokens"]:
    if "expanded" in x.keys():
        tokens += [xt["text"] for xt in x["expanded"]]
    else:
        tokens.append(x["text"])

print(tokens)

This is the result:
['ה', 'אבחנה', 'נקבעת', 'על', 'פי', 'רמת', 'ה', 'טריגליצרידים', '(', 'יותר', 'מ', '-', '200', 'מ"ג', 'דל', ')', 'ב', 'נוזל', 'ה', 'מיימת', '.']
sorry about the reverse [, it's because the Hebrew right-to-left
the warning I get from the training is about the dash before the 200, as you can see the tokenizer separates the dash, I also see the same behavior when I've got period attached to the entity in end-of-sentence, the indices of the entity are correct

polm · 2022-02-18T06:21:39Z

polm
Feb 18, 2022

So just to be clear, the error you're getting normally means that you have an entity that looks like this, assuming entity boundaries are indicated by brackets:

[Sall]y walked down the street.

Because the entity boundary comes in the middle of a token, and every token needs exactly one entity label, a label covering half a token isn't usable.

You say that you're using a custom tokenizer, but it's not clear how your sample code is integrated with spaCy. Can you share your config and how you customized the tokenizer? Can you give some example data? It isn't clear to me from your description of your custom tokenizer or your example sentence where the problem would be, partly because you don't provide any entity data.

I understand you may be unable to share your data, but if you do have a public repo with a small version of the problem we would be glad to take a look at it.

2 replies

dmoti Feb 20, 2022
Author

The problem starts before training, I'm preparing the data in json files and I'm using convert.py utility I've got from NER project template, the warning comes from this file, I don't see an option to specify/change the tokenizer used at that stage, so I'm guessing it uses default spacy tokenizer, I do specify the language (he) and I see that it's included in the command line:
python scripts/convert.py he assets/train.json corpus/train.spacy

I used the default spacy hebrew tokenizer and it looks like the default spacy tokenizer doesn't handle these sentences well, here is an example:

import spacy
from spacy.tokens import DocBin

sentence='מפגש זה של שני התאים, T ו-B, יוביל בדיעבד להפרשת IgE על ידי תאי B.'
nlp = spacy.blank('he')
doc = nlp.tokenizer(sentence)

for token in doc:
    print(token)

This is the output:

...
יוביל
בדיעבד
להפרשת
IgE
על
ידי
תאי
B.

It doesn't separate the period from the B at the end of the sentence.

So what I need to do is, to replace the default spacy tokenizer, so that I'll be able to create a valid doc span, but I'm not sure how to do it, this is the relevant code from convert.py:

import srsly
import typer
import warnings
from pathlib import Path

import spacy
from spacy.tokens import DocBin


def convert(lang: str, input_path: Path, output_path: Path):
    nlp = spacy.blank(lang)
    db = DocBin()
    for text, annot in srsly.read_json(input_path):
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label)
            if span is None:
                msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
                warnings.warn(msg)
            else:
                ents.append(span)

        #print("------------------------------------")
        #print(text)
        #print('ents:', ents)
        doc.ents = ents
        db.add(doc)
    db.to_disk(output_path)


if __name__ == "__main__":
    typer.run(convert)

polm Feb 21, 2022

Ah, I see. If I understand correctly you have an entity like He works at [A B]. and the tokenizer is causing issues with that.

In that case the behavior is not Hebrew-specific, though it might be different from what you want if the latin alphabet is treated consistently the wrong way because it's considered non-standard. I would recommend looking at the tokenizer guide and maybe removing the period from the list of suffixes if you never want it joined to preceding letters. I'm not familiar with Hebrew but it seems like abbreviations don't use periods? If that's the case then we may want to amend our default settings.

dmoti · 2022-03-03T15:42:00Z

dmoti
Mar 3, 2022
Author

I tried to replace the default spacy tokenizer with my own (using trankit) and I'm still getting errors in convert, for example this is my sentence: (notice that the period appear on the right, it actually on the left)

'ככלל, חולים עם ספירת טסיות מעל 30,000 למק"ל ללא דמם משמעותי וללא סיכון מוגבר לדמם אינם זקוקים בדרך כלל לטיפול.'

This is how the tokenization looks like:

['ככלל', ',', 'חולים', 'עם', 'ספירת', 'טסיות', 'מעל', '30,000', 'ל', 'מק"ל', 'ללא', 'דמם', 'משמעותי', 'ו', 'ללא', 'סיכון', 'מוגבר', 'ל', 'דם_', '_של_', '_הם', 'אינם', 'זקוקים', 'ב', 'דרך', 'כלל', 'ל', 'טיפול', '.']

when I run convert I'm getting the following error:
'Skipping entity [48, 59, CLINIC-FIND] in the following text because the character span ' דמם משמעות' does not align with token boundaries:\n\n'ככלל, חולים עם ספירת טסיות מעל 30,000 למק"ל ללא דמם משמעותי וללא סיכון מוגבר לדמם אינם זקוקים בדרך כלל לטיפול.'\n'

it misses the last letter of the entity, but the letter exist in the token and the span is correct
I tried to debug the issue and when I print the text inside the doc in convert.py right after the error message is issued I see this:
'ככלל , חולים עם ספירת טסיות מעל 30,000 למק"ל ללא דמם משמעותי וללא סיכון מוגבר לדמם אינם זקוקים בדרך כלל לטיפול . '
a space is added before the first comma and after the final period, this caused the entity to be misaligned with the tokenizer. I have no idea where this spaces came from, in my train.json the string is without these spaces. Strangely the comma in the number 30,000 stayed without extra space.

1 reply

polm Mar 6, 2022

That's weird. spaCy shouldn't ever add spaces, so not sure how that could be happening - could the tokenizer be doing something? If you have a code sample / repo we could look at we could help you narrow down the error more.

dmoti · 2022-03-10T14:44:26Z

dmoti
Mar 10, 2022
Author

I prepared a package with example code, I included requirements.txt for the virtual env
in the package there are 2 python files convert.py and heb_tokenizer_convert.py to run the example you should create a virtual env with the requirements I sent and run in the virtual env:
python convert.py he train.json train.spacy
the program should stop at a breakpoint set in line 33 in convert.py, in the debugger you can see the following:

ipdb> text
'ככלל, חולים עם ספירת טסיות מעל 30,000 למק"ל ללא דמם משמעותי וללא סיכון מוגבר לדמם אינם זקוקים בדרך כלל לטיפול.'
ipdb> doc.text
'ככלל , חולים עם ספירת טסיות מעל 30,000 למק"ל ללא דמם משמעותי וללא סיכון מוגבר לדמם אינם זקוקים בדרך כלל לטיפול . ' '
ipdb> msg
'Skipping entity [48, 59, CLINIC-FIND] in the following text because the character span \' דמם משמעות\' does not align with token boundaries:\n\n\'ככלל, חולים עם ספירת טסיות מעל 30,000 למק"ל ללא דמם משמעותי וללא סיכון מוגבר לדמם אינם זקוקים בדרך כלל לטיפול.\'\n'
ipdb>

you can see that the doc.text has 3 added spaces, the added spaces are in position 4 and before and after the last period
the code is a bit messy, but the idea is to get the tokens from trankit and pass it as a parameter to the custom tokenizer (heb_tokenizer_convert.py), the reason I do this is because trankit is slow and I want to tokenize everything before doing convert
tst.zip

1 reply

polm Mar 11, 2022

OK, thanks for the sample code.

I'm having difficulty installing trankit but I think what is happening is you are creating the Doc without space information and that's causing this issue.

You have code like this:

doc = nlp.make_doc((text, tokens['sentences'][0]['tokens']))

But instead you should have something like:

# words is a list of strings
# spaces is a list of booleans indicating whether a space follows the token
doc = Doc(nlp.vocab, words=tokens, spaces=spaces)

It may be helpful to look at how this is handled in tokenizers for other languages, like Japanese.

If you don't provide spaces, spaCy assumes there is a space after every token. This behaviour is necessary for handling conll data which often lacks any information about spaces, but if you have real space information you should provide it.

Uh oh!

Skipping entity ... in the following text because the character span does not align with token boundaries #10331

Uh oh!

dmoti Feb 17, 2022

Replies: 3 comments · 4 replies

Uh oh!

polm Feb 18, 2022

Uh oh!

Uh oh!

dmoti Feb 20, 2022 Author

Uh oh!

polm Feb 21, 2022

Uh oh!

Uh oh!

dmoti Mar 3, 2022 Author

Uh oh!

polm Mar 6, 2022

Uh oh!

dmoti Mar 10, 2022 Author

Uh oh!

polm Mar 11, 2022

dmoti
Feb 17, 2022

Replies: 3 comments 4 replies

polm
Feb 18, 2022

dmoti Feb 20, 2022
Author

dmoti
Mar 3, 2022
Author

dmoti
Mar 10, 2022
Author