Customize whitespace tokenization #9978
-
[Edited to add whitespace tokenizer example] I'm testing Spacy tokenizing / sentencizing from the default English transformer model on domain-specific texts that heavily use indentation. On the whole sentencizing works fine, as does non-whitespace tokenization. The problem is the whitespace tokenization.
I've read all the tokenization, sentencizing, merging/splitting and rule-based matching docs as well as discussion threads and unsure which approach is best:
Any suggestions please? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 10 replies
-
To clarify, are your examples supposed to be one token per line or one sentence per line? |
Beta Was this translation helpful? Give feedback.
-
Special case doesn't work ...
|
Beta Was this translation helpful? Give feedback.
-
I've decided for the time being to not fiddle with the whitespace tokenization, and simply mark tokens starting with \n\n\ as sentence starters. That way the indentation is correctly included in the following sentence rather than the preceding sentence, although I just need to ignore the newlines in some of the later components. Perhaps at a later time I can tinker with whitespace tokenization to split \n\n and \t\t into two tokens. |
Beta Was this translation helpful? Give feedback.
-
Does this do what you want?
Output:
This is just following the docs for modifying existing rules. |
Beta Was this translation helpful? Give feedback.
Does this do what you want?
Output:
This is just following the docs for modifying existing rules.