Whitespace tokenizer when training on cli #4991

CatarinaPC · 2020-02-10T11:26:03Z

CatarinaPC
Feb 10, 2020

I'm training NER models using the spacy train command.

When using this model to make predictions (by using nlp = spacy.load(path/to/model) doc=nlp(sentence) in Python), it first performs tokenization (I assume, like it is described here).

Is it possible to change the tokenization to be just based on whitespace? This is because I perform the tokenization prior to making predictions.

Also, when training, is there any re-tokenization of the input?

adrianeboyd · 2020-02-10T16:07:42Z

adrianeboyd
Feb 10, 2020

If you train with spacy train and your training data does not have raw text ("raw"), then it will train using the provided tokenization. (There is a -G option that uses the provided tokenization AND splits each sentence into an individual document, but I wouldn't recommend it unless you're trying to run some specialized parsing evaluation metrics. It's better to just temporarily remove "raw" from your data and have multi-sentence paragraphs.)

When you load your newly trained model, it will try to use the tokenizer for the language you provided (e.g., en), so the model won't work quite like you expect and you'll need to modify the tokenizer somehow.

One option is adding a custom whitespace tokenizer like this, see: https://spacy.io/usage/linguistic-features#custom-tokenizer-example

This kind of tokenizer replacement won't be saved when you serialize the model, so if you need to save/distribute the model and have it run without extra code, you'll need to modify the default tokenizer properties (prefix, suffix, infix, etc.) to get it to only tokenize on whitespace and serialize that.

A basic sketch for the settings for a whitespace-only tokenizer (you just tell it not to do anything with affixes or special cases):

nlp = spacy.load("my_model")
re_nothing = re.compile("a^")
nlp.tokenizer.prefix_search = re_nothing.search 
nlp.tokenizer.suffix_search = re_nothing.search
nlp.tokenizer.infix_finditer = re_nothing.finditer
nlp.tokenizer.token_match = re_nothing.match
nlp.tokenizer.rules = {}

Modify the tokenizer like this and then save the model to a new location with to_disk().

I'd recommend the option that modifies the tokenizer properties, but a third option would be to initialize your documents with Doc(words=words, spaces=spaces) and then just apply the NER model with nlp.get_pipe("ner")(doc), to skip whatever tokenizer is saved with the pipeline.

0 replies

CatarinaPC · 2020-02-14T00:29:51Z

CatarinaPC
Feb 14, 2020
Author

I followed your suggestion of modifying the tokenizer properties. I used the code you provided and replaced the tokenizer on the model and then saved it using nlp.to_disk().

The problem is, the tokenizer is tokenizing on dash character and I only want to tokenize on whitespace.

Example:

"Est-ce là ce qu' on avait voulu ?"

Is tokenized as:

Est
-ce
là
ce
qu'
on
avait
voulu
?

What do I need to change in the code?

0 replies

adrianeboyd · 2020-02-17T12:37:10Z

adrianeboyd
Feb 17, 2020

Whoops, that's a bug! When it loads the exceptions from the saved model, it treats {} the same as "no setting found", which means it keeps the language defaults.

If you want to compile from source, change this line:

spaCy/spacy/tokenizer.pyx

Line 579 in f6ed07b

if data.get("rules"):

to:

        if "rules" in data:

If you don't want to compile from source, the third option above where you create Doc from words is the easiest workaround, I think. I'll fix this and hopefully we'll have a v2.2.4 release relatively soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whitespace tokenizer when training on cli #4991

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Whitespace tokenizer when training on cli #4991

Uh oh!

Uh oh!

CatarinaPC Feb 10, 2020

Replies: 3 comments

Uh oh!

adrianeboyd Feb 10, 2020

Uh oh!

CatarinaPC Feb 14, 2020 Author

Uh oh!

adrianeboyd Feb 17, 2020

CatarinaPC
Feb 10, 2020

adrianeboyd
Feb 10, 2020

CatarinaPC
Feb 14, 2020
Author

adrianeboyd
Feb 17, 2020