Skip to content
Discussion options

You must be logged in to vote

The default English tokenizer settings don't split on ( as an infix, just as a prefix. You can modify the tokenizer settings for your task: https://spacy.io/usage/linguistic-features#tokenization

But be aware that the performance for a trained model like en_core_web_sm might degrade a bit if you modify its tokenizer settings because it wasn't trained on this exact tokenization.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / en English language data and models feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #10696 on April 29, 2022 08:32.