doc.char_span return None when reading Japanese #12082
Answered
by
polm
BrambleXu
asked this question in
Help: Coding & Implementations
-
How to reproduce the behaviourimport spacy
from spacy.tokens import DocBin
nlp = spacy.blank("ja")
error_sample = ('また、インド洋地域を重視し、独伊の作戦と呼応し、機を見てインド・西亜打通作戦を完遂し、戦争終末促進に努めようとした。', [(3, 7, '地名'), (14, 15, '地名'), (15, 16, '地名'), (28, 38, 'イベント名')])
text, annotations = error_sample
for text, annotations in train_data:
doc = nlp(text)
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents This will give below error message
I print out the
The
How to fix this issue? Your Environment
|
Beta Was this translation helpful? Give feedback.
Answered by
polm
Jan 10, 2023
Replies: 1 comment 2 replies
-
Sorry you're having trouble with this. The issue is that the tokenizer is treating 独伊 as a single token, and so your character boundaries don't align with token boundaries. In that case |
Beta Was this translation helpful? Give feedback.
2 replies
Answer selected by
adrianeboyd
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Sorry you're having trouble with this. The issue is that the tokenizer is treating 独伊 as a single token, and so your character boundaries don't align with token boundaries. In that case
char_span
returnsNone
, as mentioned in the docs.