Omit separate 'space' tokens in Doc generation #10213

k-sap · 2022-02-04T17:13:39Z

k-sap
Feb 4, 2022

Hi,
I'd like to achieve state when during creating Doc using nlp(text) the spaces, hard-spaces etc. would not be included in the final Doc.
I can always filter tokens with upos='space' however then the dependency trees will not be correct anymore. (That's possibly only drawback of this approach).

Is there some modification to not include space tokens during text processing?
Is there easy way to filter space tokens from Doc with dependency tree correction?
Is there a finite set of utf-8 characters that are assigned 'space' token and treated as separate token?
I can do basic regex for \n, (two spaces) but after that I find other more difficult to catch cases that are treated as separate token in Doc

Answered by polm

Feb 6, 2022

Since recent versions of spaCy allow you to pass Docs to pipelines, what you can do is use a blank pipeline for tokenization, then make a list of non-space tokens, then make a Doc from that and pass it to the real pipeline. Something like:

import spacy
from spacy.tokens import Doc

blank = spacy.blank("en")
nlp = spacy.load("my_model")

doc = nlp("I    like\t\tcheese.")
toks = [tok for tok in doc if tok.is_space]
words = [tok.text for tok in toks]
spaces = [tok.whitespace_ for tok in toks]
doc2 = Doc(nlp.vocab, words=words, spaces=spaces)
doc2 = nlp(doc2) # tokenizer won't run but other components will

A simpler thing you can do with regex is something like:

text = re.sub(r"\s+", " ", te…

View full answer

polm · 2022-02-06T05:26:09Z

polm
Feb 6, 2022

Since recent versions of spaCy allow you to pass Docs to pipelines, what you can do is use a blank pipeline for tokenization, then make a list of non-space tokens, then make a Doc from that and pass it to the real pipeline. Something like:

import spacy
from spacy.tokens import Doc

blank = spacy.blank("en")
nlp = spacy.load("my_model")

doc = nlp("I    like\t\tcheese.")
toks = [tok for tok in doc if tok.is_space]
words = [tok.text for tok in toks]
spaces = [tok.whitespace_ for tok in toks]
doc2 = Doc(nlp.vocab, words=words, spaces=spaces)
doc2 = nlp(doc2) # tokenizer won't run but other components will

A simpler thing you can do with regex is something like:

text = re.sub(r"\s+", " ", text)

This will replace all runs of whitespace, including unicode whitespace, with a single space.

Is there a finite set of utf-8 characters that are assigned 'space' token and treated as separate token?

Like the str.isspace() method in Python, this is based on Unicode character attributes. It is therefore finite at any time but new characters can get added in the future. However since the \s pattern is Unicode compatible you can just use that.

3 replies

k-sap Feb 8, 2022
Author

Thanks for extensive answers for all the questions!

Should the 7. line of first code snippet use blank pipeline first instead of using nlp twice?

polm Feb 9, 2022

Should the 7. line of first code snippet use blank pipeline first instead of using nlp twice?

Ah, yes, my mistake. Using the blank pipeline for the tokenization step should be faster since it's just the tokenizer, which is why it's there.

k-sap Feb 9, 2022
Author

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Omit separate 'space' tokens in Doc generation #10213

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Omit separate 'space' tokens in Doc generation #10213

Uh oh!

Uh oh!

k-sap Feb 4, 2022

Replies: 1 comment · 3 replies

Uh oh!

polm Feb 6, 2022

Uh oh!

k-sap Feb 8, 2022 Author

Uh oh!

polm Feb 9, 2022

Uh oh!

k-sap Feb 9, 2022 Author

k-sap
Feb 4, 2022

Replies: 1 comment 3 replies

polm
Feb 6, 2022

k-sap Feb 8, 2022
Author

k-sap Feb 9, 2022
Author