Training a language model with a custom tokenizer function on .spacy file #12772
-
Hi everyone, I am working on strapping together a SpaCy model for Tibetan. I have a corpus large enough to train a model on, including a fairly large gold-standard dataset, but tokenization remains an issue in SpaCy. There is a tokenizer that exists independently that I want to implement as part of my config file, but the pipeline needs to train on a binary .spacy file, converted from data in CoNLL-U format, but my custom tokenizer wants to read plaintext. Alternatively, if there is a way to substitute the default tokenizer (which can be English, or anything else really) for the one that I plan to use post hoc and overwrite the model, that would also work. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Thank you for your question. You can indeed replace the default tokenizer. This entails writing a function that creates the custom tokenizer (which should implement the tokenizer API) and exposing it as an entry point. How to use a custom tokenizer is described here: |
Beta Was this translation helpful? Give feedback.
Thank you for your question. You can indeed replace the default tokenizer. This entails writing a function that creates the custom tokenizer (which should implement the tokenizer API) and exposing it as an entry point. How to use a custom tokenizer is described here:
https://spacy.io/usage/linguistic-features#custom-tokenizer