-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
I'm using spacy v 3.2.1 with python 3.8.8 on ubuntu 20.04
I see these warning when I start training NER, the problem is with punctuation marks at the beginning or end of entity, specifically an entity that begin with dash or end with period, I've plugged my own tokenizer to the training and I know that my tokenizer separate between these two cases and the entity
I'm using trankit tokenizer, an example:
trankit_nlp=trankit.Pipeline("hebrew", gpu=False)
sentence='האבחנה נקבעת על פי רמת הטריגליצרידים (יותר מ-200 מ"ג דל) בנוזל המיימת.'
# this is how I use the tokenizer:
ht_tokens = trankit_nlp.tokenize(sentence)
tokens=list()
for x in ht_tokens["sentences"][0]["tokens"]:
if "expanded" in x.keys():
tokens += [xt["text"] for xt in x["expanded"]]
else:
tokens.append(x["text"])
print(tokens)
This is the result:
['ה', 'אבחנה', 'נקבעת', 'על', 'פי', 'רמת', 'ה', 'טריגליצרידים', '(', 'יותר', 'מ', '-', '200', 'מ"ג', 'דל', ')', 'ב', 'נוזל', 'ה', 'מיימת', '.']
sorry about the reverse [, it's because the Hebrew right-to-left
the warning I get from the training is about the dash before the 200, as you can see the tokenizer separates the dash, I also see the same behavior when I've got period attached to the entity in end-of-sentence, the indices of the entity are correct