Customize whitespace tokenization #9978

djmechanic · 2022-01-04T07:40:47Z

djmechanic
Jan 4, 2022

[Edited to add whitespace tokenizer example]

I'm testing Spacy tokenizing / sentencizing from the default English transformer model on domain-specific texts that heavily use indentation. On the whole sentencizing works fine, as does non-whitespace tokenization. The problem is the whitespace tokenization.
Because indentation is important to some of my downstream pipeline components, I need to direct the way Spacy tokenizes "\n", "\n\n" and "\t\t".

# What I get
"        YABBA",
"\n\n        ",
"DABBA\n\n        ",
"DOO"

# What I want; placement of \n\n and \t\t are important later in the pipeline
"        YABBA\n\n",
"        DABBA\n\n",
"        DOO"

I've read all the tokenization, sentencizing, merging/splitting and rule-based matching docs as well as discussion threads and unsure which approach is best:

I don't want to retrain the transformer model on indentation if a rule-based approach works
I don't want to write my whole own tokenizer when 95% of the default tokenization works fine
Tokenizer special cases don't appear to work for whitespace (and limited documentation on this)
Mark \n\n as a suffix and \t\t as a prefix, but how do I add those to the tokenizer without overriding default functionality?
Use retokenize, but how do I run this after the tokenizer before my other pipeline components e.g. parser/sentencizer?
Use token_match, but docs say this is to prevent token splits so I have no idea if its relevant here (also limited documentation on this)
The docs and a number of threads show the example of the whitespace tokenizer, but as I understand this I'd be overriding the default tokenizer with a whitespace-only tokenizer i.e. I'd lose all the other rules. What I want is all the other rules only to slightly modify the whitespace rule.

Any suggestions please?

Answered by polm

Jan 7, 2022

Does this do what you want?

import spacy

nlp = spacy.blank("en")

suffixes = nlp.Defaults.suffixes + ['\n$', '\t$']
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search


text = "\t\thello\n\n\t\thow are you"
for tok in nlp(text):
    print(repr(tok.text))

Output:

'\t'
'\t'
'hello'
'\n'
'\n'
'\t'
'\t'
'how'
'are'
'you'

This is just following the docs for modifying existing rules.

View full answer

polm · 2022-01-04T07:50:47Z

polm
Jan 4, 2022

To clarify, are your examples supposed to be one token per line or one sentence per line?

6 replies

polm Jan 4, 2022

There is probably a way to modify this with the tokenizer character sets, but as a minimally complex workaround have you considered replacing \t and \n with TAB and NEWLINE (surrounded by spaces) or something?

djmechanic Jan 4, 2022
Author

Thank you for the suggestion; simple indeed.

Would those be custom tokens (which could be filtered out by downstream components) or the literal words "tab" and "newline"? In the latter would additional words have adverse effects on sentencizing, parser (dependencies), textcat, span cat, coref etc.?

polm Jan 5, 2022

I mean just do a string replace in your input, like text.replace("\n", " NEWLINE "). It will have slightly different embedding properties but I don't expect it to matter.

djmechanic Jan 6, 2022
Author

Thank you for the suggestion, Paul. I'd prefer to not follow this approach because it injects additional words which may alter how the sentences are parsed (which is crucial).

polm Jan 7, 2022

it injects additional words which may alter how the sentences are parsed

Note that if you make your newlines and tabs into tokens, they are also additional words, so that probably doesn't matter.

djmechanic · 2022-01-04T10:49:15Z

djmechanic
Jan 4, 2022
Author

Special case doesn't work ...

nlp = spacy.load("en_core_web_sm")
text = nlp("\t\tpeter\n\n\t\tpaul\n\n\t\tmary")

print([w.text for w in text])  # ['\t\t', 'peter', '\n\n\t\t', 'paul', '\n\n\t\t', 'mary']
nlp.tokenizer.add_special_case("\n\n\t\t", [{ORTH: "\n\n"}, {ORTH: "\t\t"}])
print([w.text for w in text])  # ['\t\t', 'peter', '\n\n\t\t', 'paul', '\n\n\t\t', 'mary']

0 replies

djmechanic · 2022-01-06T15:09:41Z

djmechanic
Jan 6, 2022
Author

I've decided for the time being to not fiddle with the whitespace tokenization, and simply mark tokens starting with \n\n\ as sentence starters. That way the indentation is correctly included in the following sentence rather than the preceding sentence, although I just need to ignore the newlines in some of the later components. Perhaps at a later time I can tinker with whitespace tokenization to split \n\n and \t\t into two tokens.

0 replies

polm · 2022-01-07T05:58:36Z

polm
Jan 7, 2022

Does this do what you want?

import spacy

nlp = spacy.blank("en")

suffixes = nlp.Defaults.suffixes + ['\n$', '\t$']
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search


text = "\t\thello\n\n\t\thow are you"
for tok in nlp(text):
    print(repr(tok.text))

Output:

'\t'
'\t'
'hello'
'\n'
'\n'
'\t'
'\t'
'how'
'are'
'you'

This is just following the docs for modifying existing rules.

4 replies

djmechanic Jan 7, 2022
Author

Paul thank you greatly for being so persistent :-) Yes it almost does what I want, although I'd make \n\n a suffix and \t\t a prefix.

However, I did consider this but after reading those docs you linked to, discarded the idea because two things looked to be a problem: 1) it modifies nlp.Defaults which the docs said doesn't work on a pre-trained model (en_core_web_trf), and 2) it wasn't clear how to modify existing rules rather than overwriting the existing rules. Yes I could copy/paste/modify the Spacy rule source code, but that seems like a brittle and non-future-proof way about it. These two "problems" may be my misunderstanding the docs, so I'd appreciate if you could correct me should I be poorly informed.

Going back to the original issue the root cause is the way the default tokenizer collects whitespace: it doesn't distinguish between newlines, tabs and spaces. In my case I'd like to segment those as separate tokens because they each may mean something different in my domain. What I have learned in the meantime is that I can write a token matcher to inspect/modify tokens before the remaining pipeline components are called. Hence an additional option now appears to be available (explained by Ines in this SO post: https://stackoverflow.com/questions/50752266/spacy-tokenize-quoted-string/50775597#50775597) I may be able to inspect tokens and manually split \n\n and \t\t before running my parsers, NER etc.

polm Jan 7, 2022

it modifies nlp.Defaults which the docs said doesn't work on a pre-trained model (en_core_web_trf)

Models are sensitive to tokenization, so if tokenization changes existing tokens it will change model results. But since there were no meaningful results for tokens containing whitespace this seems unlikely to cause issues, unlike most conceivable changes.

it wasn't clear how to modify existing rules rather than overwriting the existing rules

This is just modifying the existing rules... See how it adds to the existing defaults here.

suffixes = nlp.Defaults.suffixes + ['\n$', '\t$']

The strategy in the SO post should also work.

djmechanic Jan 7, 2022
Author

[EDIT: add answer for how to make new token its own head]

Thank you yet again for the information.

Following the SO post I have put together something that (almost) works, see below. The documentation didn't help much with how to inject this code into the pipeline, so I was confused about how the retokenization would impact all my downstream components. The SO post was helpful in explaining that I can add a function to merge/split tokens before the other components see the tokens. This post tokenization function is useful because now I can perform any kind of corrective surgery on the tokenization.

@spacy.language.Language.component("post_tokenize")
def post_tokenize(doc):
    with doc.retokenize() as retokenizer:
        for token in doc:
            # EDIT: Index of token after split (in this case split into 2)
            heads = [(token, 0), (token, 1)]

            # Split \n\n\t\t
            if (match := re.search(r"(\n+)(\t+)", token.text)):
                retokenizer.split(token, list(match.groups()), heads=heads)

    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("post_tokenize", first=True)
doc = nlp("One-two\n\n\t\tone-two, this is just a test.")

But I'm doing splitting instead of merging as was given in the SO and docs examples. Again I've had to figure most out through trial and error (the retokenizer docs and source weren't very helpful to my case). The split function wants heads and attributes, but at this stage of the pipeline (directly after the tokenizer) each token is its own head and the tokens have no attributes. So I can safely ignore the attributes? [EDIT: I can point the newly split token's head to itself using the (token, index) notation referred to in the docs].

polm Jan 11, 2022

Sounds like you figured it out, but it's safe to ignore empty attributes.

Uh oh!

Customize whitespace tokenization #9978

Uh oh!

Uh oh!

Replies: 4 comments · 10 replies

Uh oh!

Uh oh!

Uh oh!

djmechanic Jan 4, 2022 Author

Uh oh!

Uh oh!

Uh oh!

djmechanic Jan 6, 2022 Author

Uh oh!

Uh oh!

Uh oh!

djmechanic Jan 4, 2022 Author

Uh oh!

djmechanic Jan 6, 2022 Author

Uh oh!

Uh oh!

Uh oh!

djmechanic Jan 7, 2022 Author

Uh oh!

Uh oh!

Uh oh!

djmechanic Jan 7, 2022 Author

Uh oh!

Replies: 4 comments 10 replies

djmechanic Jan 4, 2022
Author

djmechanic Jan 6, 2022
Author

djmechanic
Jan 4, 2022
Author

djmechanic
Jan 6, 2022
Author

djmechanic Jan 7, 2022
Author

djmechanic Jan 7, 2022
Author