doc.char_span return None when reading Japanese #12082

BrambleXu · 2023-01-10T07:19:33Z

BrambleXu
Jan 10, 2023

How to reproduce the behaviour

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("ja")

error_sample = ('また、インド洋地域を重視し、独伊の作戦と呼応し、機を見てインド・西亜打通作戦を完遂し、戦争終末促進に努めようとした。', [(3, 7, '地名'), (14, 15, '地名'), (15, 16, '地名'), (28, 38, 'イベント名')])

text, annotations = error_sample

for text, annotations in train_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents

This will give below error message

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[56], line 16
     14     span = doc.char_span(start, end, label=label)
     15     ents.append(span)
---> 16 doc.ents = ents

File ~/opt/miniconda3/envs/mccdic/lib/python3.8/site-packages/spacy/tokens/doc.pyx:758, in spacy.tokens.doc.Doc.ents.__set__()

File ~/opt/miniconda3/envs/mccdic/lib/python3.8/site-packages/spacy/tokens/doc.pyx:1974, in spacy.tokens.doc.get_entity_info()

TypeError: object of type 'NoneType' has no len()

I print out the ents, and it contains two None

# print(ents)
[インド洋, None, None, インド・西亜打通作戦]

The ents should be below

[インド洋, 独, 伊, インド・西亜打通作戦]

How to fix this issue?

Your Environment

Operating System: Mac M1
Python Version Used: 3.8
spaCy Version Used: 3.4.4
Environment Information:

Answered by polm

Jan 10, 2023

Sorry you're having trouble with this. The issue is that the tokenizer is treating 独伊 as a single token, and so your character boundaries don't align with token boundaries. In that case char_span returns None, as mentioned in the docs.

View full answer

polm · 2023-01-10T08:26:37Z

polm
Jan 10, 2023

Sorry you're having trouble with this. The issue is that the tokenizer is treating 独伊 as a single token, and so your character boundaries don't align with token boundaries. In that case char_span returns None, as mentioned in the docs.

2 replies

BrambleXu Jan 10, 2023
Author

Sorry you're having trouble with this. The issue is that the tokenizer is treating 独伊 as a single token, and so your character boundaries don't align with token boundaries. In that case char_span returns None, as mentioned in the docs.

Thanks for the reply.
Is there any way to deal with this issue?

polm Jan 10, 2023

With compounds like this in Japanese ("dvandva" is the linguistic term), whether they're treated as a single token by the tokenizer or not is going to be pretty variable, and will depend on whether they're in the dictionary. For example, 伊独 would be two tokens, as would 独米, but 米独 is one token - it depends on what's conventional. If you really need them to be separate entities, you'll need to look at retokenizing the string, though I would maybe look at recognizing specific entities like this in a post-processing step.

You could also look at using a SudachiPy user dictionary to modify the tokenization and eliminate compounds like this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

doc.char_span return None when reading Japanese #12082

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

doc.char_span return None when reading Japanese #12082

Uh oh!

BrambleXu Jan 10, 2023

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

polm Jan 10, 2023

Uh oh!

BrambleXu Jan 10, 2023 Author

Uh oh!

polm Jan 10, 2023

BrambleXu
Jan 10, 2023

Replies: 1 comment 2 replies

polm
Jan 10, 2023

BrambleXu Jan 10, 2023
Author