Fastest way to split a sentence to tokens #11944

antonpibm · 2022-12-07T12:50:17Z

antonpibm
Dec 7, 2022

Hello everyone.

I want to split a sentence into tokens using the language model/tokenizer. From my understanding there are 2 approaches I can use (1)iterate over the tokens in nlp(sentence), (2) iterate over the tokens in nlp.tokenizer.explain(sentence)

When I compare the performance of the 2 approaches, I see an interesting behavior, approach (1) is X4.5 times faster!
Now that's counter intuitive for me, as generating the document has a lot more overhead compared to only splitting the sentence into words. From here I have 2 questions:

Is there a faster way to split a sentence into tokens?
Why and How approach (1) is faster

Adding the snippet of the experiment, note that tqdm outputs both it/s and the number of processed sentences in 10 seconds:

BTW I know about pipe, want to compare the two methods for single text

Answered by polm

Dec 8, 2022

Please don't post screenshots of code or terminal output, post them as text.

nlp.tokenizer.explain is not intended for normal tokenization, it's used to explain or debug the output of the tokenizer. It is not intended to be efficient, and is not the normal way to use the tokenizer.

In spaCy, generally the fastest way to tokenize things is basically to use a blank pipeline (like spacy.blank("en")) which just runs the tokenizer. You can also just call the tokenizer directly (nlp.tokenizer(text)). Note that spaCy tokenizers don't use the language model.

View full answer

polm · 2022-12-08T05:22:20Z

polm
Dec 8, 2022

Please don't post screenshots of code or terminal output, post them as text.

nlp.tokenizer.explain is not intended for normal tokenization, it's used to explain or debug the output of the tokenizer. It is not intended to be efficient, and is not the normal way to use the tokenizer.

In spaCy, generally the fastest way to tokenize things is basically to use a blank pipeline (like spacy.blank("en")) which just runs the tokenizer. You can also just call the tokenizer directly (nlp.tokenizer(text)). Note that spaCy tokenizers don't use the language model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fastest way to split a sentence to tokens #11944

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Fastest way to split a sentence to tokens #11944

Uh oh!

Uh oh!

antonpibm Dec 7, 2022

Replies: 1 comment

Uh oh!

polm Dec 8, 2022

antonpibm
Dec 7, 2022

polm
Dec 8, 2022