Spacy v2.2.43 TOKENIZE incorrect result #7933

TimVanDorpe · 2021-04-28T12:15:16Z

TimVanDorpe
Apr 28, 2021

Hi,

So, I'm using spacy to tokenize sentences in my application.
Simple things like 'can't' and 'id' are split up in multiple tokens , see attachments.
Is there a reason for this splitting up or is this still an issue/bug from this version (2.2.43) ?

Operating System: Windows-10-10.0.14393-SP0
Python Version Used: 3.7.5
spaCy Version Used: 2.2.43

Tim

Answered by polm

Apr 29, 2021

Splitting "can't" into two tokens is normal and intentional and common in NLP tools in general. It makes processing more consistent since it treats it as "can not". You will see this with other contractions like "don't" or "wouldn't".

"id" is a bit weird. I guess it's by relation to "I'd" but it seems to be treated as two tokens in any instance, including "Freud talked about the id a lot". That looks like a bug to me.

Either way this behavior is the same in the most recent version of spaCy.

View full answer

polm · 2021-04-29T05:38:11Z

polm
Apr 29, 2021

Splitting "can't" into two tokens is normal and intentional and common in NLP tools in general. It makes processing more consistent since it treats it as "can not". You will see this with other contractions like "don't" or "wouldn't".

"id" is a bit weird. I guess it's by relation to "I'd" but it seems to be treated as two tokens in any instance, including "Freud talked about the id a lot". That looks like a bug to me.

Either way this behavior is the same in the most recent version of spaCy.

0 replies

durbin-164 · 2021-08-09T11:51:39Z

durbin-164
Aug 9, 2021

@TimVanDorpe Hi, How did you overcome these issues?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spacy v2.2.43 TOKENIZE incorrect result #7933

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spacy v2.2.43 TOKENIZE incorrect result #7933

Uh oh!

TimVanDorpe Apr 28, 2021

Replies: 2 comments

Uh oh!

polm Apr 29, 2021

Uh oh!

durbin-164 Aug 9, 2021

TimVanDorpe
Apr 28, 2021

polm
Apr 29, 2021

durbin-164
Aug 9, 2021