Assigning a new tokenizer in the experimental models #9949
-
This is really a minor issue, and I know that I am working with experimental models here (the new UD ones). So please don't take this too urgently. I found that in the new models, the tokenizer (and lemmatizer) have new names to indicate that they are experimental, e.g., from typing import Union, List
import spacy
from spacy import Vocab
from spacy.tokens import Doc
def main():
nlp = spacy.load("nl_udv25_dutchalpino_trf")
print(nlp.pipe_names)
nlp.tokenizer = PretokenizedTokenizer(nlp.vocab)
doc = nlp("I like cookies. Do you like cookies?")
for sent in doc.sents:
for token in sent:
print(token.text, token.pos_, token.tag_)
class PretokenizedTokenizer:
"""Custom tokenizer to be used in spaCy when the text is already pretokenized."""
def __init__(self, vocab: Vocab):
"""Initialize tokenizer with a given vocab
:param vocab: an existing vocabulary (see https://spacy.io/api/vocab)
"""
self.vocab = vocab
def __call__(self, inp: Union[List[str], str]) -> Doc:
"""Call the tokenizer on input `inp`.
:param inp: either a string to be split on whitespace, or a list of tokens
:return: the created Doc object
"""
words = inp.split()
spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
return Doc(self.vocab, words=words, spaces=spaces)
if __name__ == "__main__":
main() However, as-is, this code will still run but lead to the whole string to be seen as a single token. After inspecting the pipeline components, I found that the tokenizer component is called As I said, this is not at all high priority. I placed it in the discussion forum because it is perhaps not really a bug but expected behavior within PS: a more generic - but still error-prone - solution would be to simply find the component that has nlp = spacy.load("nl_udv25_dutchalpino_trf")
try:
pipe_name = next(pipe for pipe in nlp.pipe_names if "tokenizer" in pipe)
setattr(nlp, pipe_name, PretokenizedTokenizer(nlp.vocab))
except StopIteration:
pass |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi Bram, We're definitely happy to get this kind of usage feedback for the components in The experimental, trainable tokenizers represent a different approach to how we've normally been treating the tokenizer in an
as documented at https://github.com/explosion/spacy-experimental#trainable-character-based-tokenizers If I understand your use-case correctly, you don't want any tokenizer - the default or the experimental trainable one. I don't think we're expecting to be adding a dozen other trainable tokenizers to But then all subsequent components after the substituted tokenizer will be impacted, so I'm not sure how much sense it makes to even want to do this? What's the use-case of using the experimental models like this? |
Beta Was this translation helpful? Give feedback.
Hi Bram,
We're definitely happy to get this kind of usage feedback for the components in
spacy-experimental
!The experimental, trainable tokenizers represent a different approach to how we've normally been treating the tokenizer in an
nlp
pipeline: atokenizer
typically wasn't part ofnlp.pipeline
ornlp.pipe_names
. Instead, like you point out, you'd access it withnlp.tokenizer
. The new trainable tokenizers however are in fact part of the pipeline, and you typically have to do one trick, setting the typicalnlp.tokenizer
to a dummy: