Incorrect tokenization: #10728

sadeghjafari5528 · 2022-04-24T08:29:06Z

sadeghjafari5528
Apr 24, 2022

Spacy code to tokenize an english text.

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp('Natural language processing(NLP)')
print([token.text for token in doc])

output of code is ['Natural', 'language', 'processing(NLP', ')'].

why spacy can not tokenize it correctly ?

Answered by adrianeboyd

Apr 29, 2022

The default English tokenizer settings don't split on ( as an infix, just as a prefix. You can modify the tokenizer settings for your task: https://spacy.io/usage/linguistic-features#tokenization

But be aware that the performance for a trained model like en_core_web_sm might degrade a bit if you modify its tokenizer settings because it wasn't trained on this exact tokenization.

View full answer

adrianeboyd · 2022-04-29T08:31:36Z

adrianeboyd
Apr 29, 2022

The default English tokenizer settings don't split on ( as an infix, just as a prefix. You can modify the tokenizer settings for your task: https://spacy.io/usage/linguistic-features#tokenization

But be aware that the performance for a trained model like en_core_web_sm might degrade a bit if you modify its tokenizer settings because it wasn't trained on this exact tokenization.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect tokenization: #10728

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Incorrect tokenization: #10728

Uh oh!

Uh oh!

sadeghjafari5528 Apr 24, 2022

Spacy code to tokenize an english text.

Replies: 1 comment

Uh oh!

adrianeboyd Apr 29, 2022

sadeghjafari5528
Apr 24, 2022

adrianeboyd
Apr 29, 2022