Skip to content
Discussion options

You must be logged in to vote

Does this do what you want?

import spacy

nlp = spacy.blank("en")

suffixes = nlp.Defaults.suffixes + ['\n$', '\t$']
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search


text = "\t\thello\n\n\t\thow are you"
for tok in nlp(text):
    print(repr(tok.text))

Output:

'\t'
'\t'
'hello'
'\n'
'\n'
'\t'
'\t'
'how'
'are'
'you'

This is just following the docs for modifying existing rules.

Replies: 4 comments 10 replies

Comment options

You must be logged in to vote
6 replies
@polm
Comment options

@djmechanic
Comment options

@polm
Comment options

@djmechanic
Comment options

@polm
Comment options

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
4 replies
@djmechanic
Comment options

@polm
Comment options

@djmechanic
Comment options

@polm
Comment options

Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants