Off by 1 error in Tokenizer #5043

hasandiwan · 2020-02-21T02:35:49Z

hasandiwan
Feb 21, 2020

My machine with SpaCy's git repo is currently packed, I'll contribute a patch when I set it up again, in a few weeks.

How to reproduce the behaviour

s = "Hello world, I am Zaf" # 5 words, by my count
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.Defaults.create_tokenizer(nlp)
tokens = tokenizer(s)
len(tokens)
6
tokens # 5 words again, if I can count
Hello world, I am Zaf

Your Environment

Info about spaCy

spaCy version: 2.2.3
Platform: Darwin-19.3.0-x86_64-i386-64bit
Python version: 3.7.6

Answered by adrianeboyd

Feb 21, 2020

Hi, the tokenizer returns a Doc object rather than just a list of tokens. You can inspect the tokens like this and see that there are 6:

doc = tokenizer(s)
print([t.text for t in doc])
# ['Hello', 'world', ',', 'I', 'am', 'Zaf']

View full answer

adrianeboyd · 2020-02-21T08:25:24Z

adrianeboyd
Feb 21, 2020

Hi, the tokenizer returns a Doc object rather than just a list of tokens. You can inspect the tokens like this and see that there are 6:

doc = tokenizer(s)
print([t.text for t in doc])
# ['Hello', 'world', ',', 'I', 'am', 'Zaf']

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Off by 1 error in Tokenizer #5043

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Off by 1 error in Tokenizer #5043

Uh oh!

hasandiwan Feb 21, 2020

How to reproduce the behaviour

Your Environment

Info about spaCy

Replies: 1 comment

Uh oh!

adrianeboyd Feb 21, 2020

hasandiwan
Feb 21, 2020

adrianeboyd
Feb 21, 2020