Assigning a new tokenizer in the experimental models #9949

BramVanroy · 2021-12-28T17:52:51Z

BramVanroy
Dec 28, 2021

This is really a minor issue, and I know that I am working with experimental models here (the new UD ones). So please don't take this too urgently.

I found that in the new models, the tokenizer (and lemmatizer) have new names to indicate that they are experimental, e.g., experimental_char_ner_tokenizer. In my case, this leads to some issues and I have to adapt my code to work for both upstream spacy and experimental. (Again, this is to be expected and I know that this is "my problem" - I just wish to explain why these names can be important.) I am used to working with pretokenized text. To do so, I would simply replace the tokenizer with a simple naive one that splits on white-space and go from there. So I'd assign this to nlp.tokenizer.

from typing import Union, List

import spacy
from spacy import Vocab
from spacy.tokens import Doc


def main():
    nlp = spacy.load("nl_udv25_dutchalpino_trf")
    print(nlp.pipe_names)
    nlp.tokenizer = PretokenizedTokenizer(nlp.vocab)

    doc = nlp("I like cookies. Do you like cookies?")

    for sent in doc.sents:
        for token in sent:
            print(token.text, token.pos_, token.tag_)


class PretokenizedTokenizer:
    """Custom tokenizer to be used in spaCy when the text is already pretokenized."""
    def __init__(self, vocab: Vocab):
        """Initialize tokenizer with a given vocab
        :param vocab: an existing vocabulary (see https://spacy.io/api/vocab)
        """
        self.vocab = vocab

    def __call__(self, inp: Union[List[str], str]) -> Doc:
        """Call the tokenizer on input `inp`.
        :param inp: either a string to be split on whitespace, or a list of tokens
        :return: the created Doc object
        """
        words = inp.split()
        spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
        return Doc(self.vocab, words=words, spaces=spaces)


if __name__ == "__main__":
    main()

However, as-is, this code will still run but lead to the whole string to be seen as a single token. After inspecting the pipeline components, I found that the tokenizer component is called experimental_char_ner_tokenizer and not tokenizer. So replacing nlp.tokenizer with nlp.experimental_char_ner_tokenizer fixes the issue. But considering that there is no tokenizer component in the printed pipeline components, it is a tad confusing: I can set my own tokenizer with self.tokenizer, I do not get a warning or error, and yet the behavior is - for me - unexpected, namely that tokenization seems to be completely disabled.

As I said, this is not at all high priority. I placed it in the discussion forum because it is perhaps not really a bug but expected behavior within spacy-experimental. I bet that once the models go into the main spaCy releases, that the names are streamlined again. However, I thought it relevant to highlight the importance of the pipe names.

PS: a more generic - but still error-prone - solution would be to simply find the component that has tokenizer in its name and replace that. So I'll use that for now

nlp = spacy.load("nl_udv25_dutchalpino_trf")
try:
    pipe_name = next(pipe for pipe in nlp.pipe_names if "tokenizer" in pipe)
    setattr(nlp, pipe_name, PretokenizedTokenizer(nlp.vocab))
except StopIteration:
    pass

Answered by svlandeg

Dec 29, 2021

Hi Bram,

We're definitely happy to get this kind of usage feedback for the components in spacy-experimental!

The experimental, trainable tokenizers represent a different approach to how we've normally been treating the tokenizer in an nlp pipeline: a tokenizer typically wasn't part of nlp.pipeline or nlp.pipe_names. Instead, like you point out, you'd access it with nlp.tokenizer. The new trainable tokenizers however are in fact part of the pipeline, and you typically have to do one trick, setting the typical nlp.tokenizer to a dummy:

nlp = spacy.blank(
    "en",
    config={
        "nlp": {
            "tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
        }
    …

View full answer

svlandeg · 2021-12-29T11:02:54Z

svlandeg
Dec 29, 2021

Hi Bram,

We're definitely happy to get this kind of usage feedback for the components in spacy-experimental!

The experimental, trainable tokenizers represent a different approach to how we've normally been treating the tokenizer in an nlp pipeline: a tokenizer typically wasn't part of nlp.pipeline or nlp.pipe_names. Instead, like you point out, you'd access it with nlp.tokenizer. The new trainable tokenizers however are in fact part of the pipeline, and you typically have to do one trick, setting the typical nlp.tokenizer to a dummy:

nlp = spacy.blank(
    "en",
    config={
        "nlp": {
            "tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
        }
    },
)

as documented at https://github.com/explosion/spacy-experimental#trainable-character-based-tokenizers

If I understand your use-case correctly, you don't want any tokenizer - the default or the experimental trainable one. I don't think we're expecting to be adding a dozen other trainable tokenizers to spacy-experimental, so having specific code to deal with the components experimental_char_ner_tokenizer and experimental_char_tagger_tokenizer might make sense at this point. For instance, you can call nlp.load("nl_udv25_dutchalpino_trf", exclude="experimental_char_ner_tokenizer") and reset nlp.tokenizer to what you want after that, so that the rest of your code would remain the same.

But then all subsequent components after the substituted tokenizer will be impacted, so I'm not sure how much sense it makes to even want to do this? What's the use-case of using the experimental models like this?

3 replies

BramVanroy Dec 29, 2021
Author

I am not sure why I need to set up the "dummy" tokenizer. Is this only needed for training? I followed the blog post, which simply uses .load(), and have not encountered any issues in terms of unexpected tokenization.

Maybe I worded my use-case incorrectly. I do not want to disable tokenization completely, but I want to replace it with a white-space tokenizer. In the example that I posted, the PretokenizedTokenizer "tokenizes" the input string by simply splitting by white space and a Doc is created based on those tokens. So my input strings have already been tokenized (and then glued back together as a single string with whitespace inbetween), I just need to split them up again. The input is e.g., I 'd like momma 's cookies. instead of I'd like momma's cookies., so I can completely by-pass the experimental tokenization and simply split my string by white-space and use those tokens.

The problem is that if I replaced nlp.tokenizer with PretokenizedTokenizer the code did not error out or gave a warning, but apparently it completely disabled tokenization: the whole input sentence was considered as a single token. I think the reason is this:

I replaced the pretokenizer nlp.tokenizer (which would do character-tokenization, Doc of characters) with PretokenizedTokenizer, which outputs a Doc of words
This means that instead of tokenized/separate characters, I pass words to the experimental tokenizer
So instead of a Doc where each "word" is a character, I gave it a Doc where each word is a word
This may have lead to unexpected behavior

I don't think there is a way for spaCy to "catch" that I was doing something unexpected, except for checking in experimental_char_ner_tokenizer that all tokens/words are indeed single characters - which it seems to expect.

The right approach to do what I want, based on your feedback, would seem:

nlp = spacy.load("nl_udv25_dutchalpino_trf", exclude="experimental_char_ner_tokenizer")
nlp.tokenizer = PretokenizedTokenizer(nlp.vocab)

which 1. excludes the experimental tokenizer; 2. replaces the "pretokenizer" with my own white-space tokenizer.

Does this make sense?

svlandeg Dec 29, 2021

That's indeed the approach that I was suggesting.

And I wasn't suggesting that you need to set up the dummy tokenizer - just that you need to be aware of it being there in the pretrained pipeline. It explains the behaviour you've been seeing, as you figured out. So yes, in a nutshell - this behaviour is a feature and not a bug: if you want to change the tokenization of the provided pipelines that include the dummy tokenizer + a trained tokenizer, you should be aware of what exactly lies underneath before you swap things out ;-)

BramVanroy Dec 29, 2021
Author

Perhaps built-in components that expect Docs with single characters as words (like the char tokenizer) could verify that, and throw an error or warning when those words consist of more than one character? That might avoid some head-scratching.

Thanks for the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Assigning a new tokenizer in the experimental models #9949

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Assigning a new tokenizer in the experimental models #9949

Uh oh!

Uh oh!

BramVanroy Dec 28, 2021

Replies: 1 comment · 3 replies

Uh oh!

svlandeg Dec 29, 2021

Uh oh!

Uh oh!

BramVanroy Dec 29, 2021 Author

Uh oh!

svlandeg Dec 29, 2021

Uh oh!

BramVanroy Dec 29, 2021 Author

BramVanroy
Dec 28, 2021

Replies: 1 comment 3 replies

svlandeg
Dec 29, 2021

BramVanroy Dec 29, 2021
Author

BramVanroy Dec 29, 2021
Author