Skip to content
Discussion options

You must be logged in to vote

This is happening because the semicolon is not in the list of infixes because normally something like "hot;dog" is a typo, not valid text. You can get what you want by adding the semicolon to the list of infixes.

import spacy
nlp = spacy.blank("en")


# fix it
infixes = nlp.Defaults.infixes + [';']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

text = "fast, great screen, beautiful apps for a laptop;priced at 1100 on the apple website;amazon had it for 1098+ tax -  plus i had a 10% off coupon from amazon-cost me 998 plus tax- 1070- OTD!"
doc = nlp(text)

for token in doc:
    print(token.text)

You can see an overview of this in …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by v-JiangNan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / en English language data and models feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #9195 on September 13, 2021 04:36.