Skip to content
Discussion options

You must be logged in to vote

Sorry, I can't tell what the input text looks like from the formatting in the question. If you have run-on words like borroweris then you might want look into libraries related to spell-checking that identify run-ons and either run this as a preprocessing step before spacy (inserting whitespace), or potentially as a postprocessing step after the tokenizer that retokenizes those tokens (which you could do while preserving the original whitespace). The rule-based tokenizer can handle predictable cases like ca n't, but not any potential run-on words.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants