Skip to content
Discussion options

You must be logged in to vote

In spaCy the ASCII space at the end of the token is not represented as a separate token, rather it becomes an attribute of the token itself.

For example:

import spacy

nlp = spacy.blank("en")
doc = nlp("Sandra met Jerry on  Monday.")
print([token.text for token in doc])

Here we have 7 tokens:

['Sandra', 'met', 'Jerry', 'on', ' ', 'Monday', '.']

We can check for the trailing half-width whitespace with

[token.whitespace_ for token in doc]

which in this case gives us:

[' ', ' ', ' ', ' ', '', '', '']

The last three tokens are the whitespace " ", which does not have a trailing whitespace and "Monday" is adjacent to ".". Finally, "." is at the end of the string. All other tokens have a trailin…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by SmallThings-coder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / spanruler Feature: Entity and span ruler
2 participants