Omit separate 'space' tokens in Doc generation #10213
-
Hi,
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Since recent versions of spaCy allow you to pass Docs to pipelines, what you can do is use a blank pipeline for tokenization, then make a list of non-space tokens, then make a Doc from that and pass it to the real pipeline. Something like:
A simpler thing you can do with regex is something like:
This will replace all runs of whitespace, including unicode whitespace, with a single space.
Like the |
Beta Was this translation helpful? Give feedback.
Since recent versions of spaCy allow you to pass Docs to pipelines, what you can do is use a blank pipeline for tokenization, then make a list of non-space tokens, then make a Doc from that and pass it to the real pipeline. Something like:
A simpler thing you can do with regex is something like: