Skip to content
Discussion options

You must be logged in to vote

Since recent versions of spaCy allow you to pass Docs to pipelines, what you can do is use a blank pipeline for tokenization, then make a list of non-space tokens, then make a Doc from that and pass it to the real pipeline. Something like:

import spacy
from spacy.tokens import Doc

blank = spacy.blank("en")
nlp = spacy.load("my_model")

doc = nlp("I    like\t\tcheese.")
toks = [tok for tok in doc if tok.is_space]
words = [tok.text for tok in toks]
spaces = [tok.whitespace_ for tok in toks]
doc2 = Doc(nlp.vocab, words=words, spaces=spaces)
doc2 = nlp(doc2) # tokenizer won't run but other components will

A simpler thing you can do with regex is something like:

text = re.sub(r"\s+", " ", te…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@k-sap
Comment options

@polm
Comment options

@k-sap
Comment options

Answer selected by k-sap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants