Skip to content
Discussion options

You must be logged in to vote

Hi Bram,

We're definitely happy to get this kind of usage feedback for the components in spacy-experimental!

The experimental, trainable tokenizers represent a different approach to how we've normally been treating the tokenizer in an nlp pipeline: a tokenizer typically wasn't part of nlp.pipeline or nlp.pipe_names. Instead, like you point out, you'd access it with nlp.tokenizer. The new trainable tokenizers however are in fact part of the pipeline, and you typically have to do one trick, setting the typical nlp.tokenizer to a dummy:

nlp = spacy.blank(
    "en",
    config={
        "nlp": {
            "tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
        }
    …

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@BramVanroy
Comment options

@svlandeg
Comment options

@BramVanroy
Comment options

Answer selected by BramVanroy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants