Whitespace tokenizer when training on cli #4991
Replies: 3 comments
-
|
If you train with When you load your newly trained model, it will try to use the tokenizer for the language you provided (e.g., One option is adding a custom whitespace tokenizer like this, see: https://spacy.io/usage/linguistic-features#custom-tokenizer-example This kind of tokenizer replacement won't be saved when you serialize the model, so if you need to save/distribute the model and have it run without extra code, you'll need to modify the default tokenizer properties (prefix, suffix, infix, etc.) to get it to only tokenize on whitespace and serialize that. A basic sketch for the settings for a whitespace-only tokenizer (you just tell it not to do anything with affixes or special cases): Modify the tokenizer like this and then save the model to a new location with I'd recommend the option that modifies the tokenizer properties, but a third option would be to initialize your documents with |
Beta Was this translation helpful? Give feedback.
-
|
I followed your suggestion of modifying the tokenizer properties. I used the code you provided and replaced the tokenizer on the model and then saved it using The problem is, the tokenizer is tokenizing on dash character and I only want to tokenize on whitespace. Example:
Is tokenized as: What do I need to change in the code? |
Beta Was this translation helpful? Give feedback.
-
|
Whoops, that's a bug! When it loads the exceptions from the saved model, it treats If you want to compile from source, change this line: Line 579 in f6ed07b to: if "rules" in data:If you don't want to compile from source, the third option above where you create |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm training NER models using the
spacy traincommand.When using this model to make predictions (by using
nlp = spacy.load(path/to/model) doc=nlp(sentence)in Python), it first performs tokenization (I assume, like it is described here).Is it possible to change the tokenization to be just based on whitespace? This is because I perform the tokenization prior to making predictions.
Also, when training, is there any re-tokenization of the input?
Beta Was this translation helpful? Give feedback.
All reactions