Skip to content
Discussion options

You must be logged in to vote

Splitting "can't" into two tokens is normal and intentional and common in NLP tools in general. It makes processing more consistent since it treats it as "can not". You will see this with other contractions like "don't" or "wouldn't".

"id" is a bit weird. I guess it's by relation to "I'd" but it seems to be treated as two tokens in any instance, including "Freud talked about the id a lot". That looks like a bug to me.

Either way this behavior is the same in the most recent version of spaCy.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by TimVanDorpe
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage feat / tokenizer Feature: Tokenizer v2 spaCy v2.x
3 participants
Converted from issue

This discussion was converted from issue #7932 on April 28, 2021 12:44.