Training a language model with a custom tokenizer function on .spacy file #12772

JamesE6 · 2023-06-29T09:46:56Z

JamesE6
Jun 29, 2023

Hi everyone,

I am working on strapping together a SpaCy model for Tibetan. I have a corpus large enough to train a model on, including a fairly large gold-standard dataset, but tokenization remains an issue in SpaCy. There is a tokenizer that exists independently that I want to implement as part of my config file, but the pipeline needs to train on a binary .spacy file, converted from data in CoNLL-U format, but my custom tokenizer wants to read plaintext.

Alternatively, if there is a way to substitute the default tokenizer (which can be English, or anything else really) for the one that I plan to use post hoc and overwrite the model, that would also work.

Thanks in advance!

Answered by danieldk

Jun 30, 2023

Thank you for your question. You can indeed replace the default tokenizer. This entails writing a function that creates the custom tokenizer (which should implement the tokenizer API) and exposing it as an entry point. How to use a custom tokenizer is described here:

https://spacy.io/usage/linguistic-features#custom-tokenizer

View full answer

danieldk · 2023-06-30T06:48:03Z

danieldk
Jun 30, 2023

Thank you for your question. You can indeed replace the default tokenizer. This entails writing a function that creates the custom tokenizer (which should implement the tokenizer API) and exposing it as an entry point. How to use a custom tokenizer is described here:

https://spacy.io/usage/linguistic-features#custom-tokenizer

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training a language model with a custom tokenizer function on .spacy file #12772

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training a language model with a custom tokenizer function on .spacy file #12772

Uh oh!

JamesE6 Jun 29, 2023

Replies: 1 comment

Uh oh!

danieldk Jun 30, 2023

JamesE6
Jun 29, 2023

danieldk
Jun 30, 2023