Incorrect tokenization: #10728
Answered
by
adrianeboyd
sadeghjafari5528
asked this question in
Help: Other Questions
-
Spacy code to tokenize an english text.import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Natural language processing(NLP)')
print([token.text for token in doc]) output of code is ['Natural', 'language', 'processing(NLP', ')']. why spacy can not tokenize it correctly ? |
Beta Was this translation helpful? Give feedback.
Answered by
adrianeboyd
Apr 29, 2022
Replies: 1 comment
-
The default But be aware that the performance for a trained model like |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
adrianeboyd
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The default
English
tokenizer settings don't split on(
as an infix, just as a prefix. You can modify the tokenizer settings for your task: https://spacy.io/usage/linguistic-features#tokenizationBut be aware that the performance for a trained model like
en_core_web_sm
might degrade a bit if you modify its tokenizer settings because it wasn't trained on this exact tokenization.