Custom Tokenizer Help #12018

alvaromarlo · 2022-12-22T20:46:03Z

alvaromarlo
Dec 22, 2022

Hi,

I am creating a custom tokenizer in Spanish to execute the add_tokens starting from the blank:es. I have already updated the tokenizer of the blank:es model to isolate the punctuation marks in different tokens. Examples:

Secretario/a -> ["Secretario", "/", "a"]
Vitoria-Gasteiz -> ["Vitoria", "-", "Gasteiz"]
Hola. -> ["Hola", "."]

I have done this including as prefixes, infixes and suffixes all the symbols available with Regex:

prefixes = nlp.Defaults.prefixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

infixes = nlp.Defaults.infixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

suffixes = nlp.Defaults.suffixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

NOTE: I'm not sure if with this Regex syntax it performs exactly as I want.

But in cases as this text fragment:
Secretario/a.
It splits the tokens in:
["Secretario", "/", "a."]
When I want:
["Secretario", "/", "a", "."]
And with this problem I found future token missmatching errors.

I am pretty sure that this is happening because it interprets .a as part of an acronym. So, how can I change that problem at the Tokenizer and obtain the results I want?

Answered by polm

Dec 23, 2022

This is a bit tricky because what you want to do here is remove an existing special case.

First, you can figure out what's going on using the explain method:

nlp.tokenizer.explain("Secretario/a.")
# => [('TOKEN', 'Secretario'), ('INFIX', '/'), ('SPECIAL-1', 'a.')]

SPECIAL refers to a tokenizer exception in this case.

You can remove a tokenizer exception by modifying tokenizer.rules, like this:

rules = nlp.tokenizer.rules
del rules["a."]
nlp.tokenizer.rules = rules

In your case, it sounds like you might want to do this for every lowercase letter. That should be easy to do in a for loop, but if you want to remove more exceptions then it might be better to make your own language subclass w…

View full answer

polm · 2022-12-23T05:01:17Z

polm
Dec 23, 2022

This is a bit tricky because what you want to do here is remove an existing special case.

First, you can figure out what's going on using the explain method:

nlp.tokenizer.explain("Secretario/a.")
# => [('TOKEN', 'Secretario'), ('INFIX', '/'), ('SPECIAL-1', 'a.')]

SPECIAL refers to a tokenizer exception in this case.

You can remove a tokenizer exception by modifying tokenizer.rules, like this:

rules = nlp.tokenizer.rules
del rules["a."]
nlp.tokenizer.rules = rules

In your case, it sounds like you might want to do this for every lowercase letter. That should be easy to do in a for loop, but if you want to remove more exceptions then it might be better to make your own language subclass where you control the exceptions directly. See the guide to language subclassing or the Spanish definition for an example of what that looks like.

4 replies

alvaromarlo Dec 23, 2022
Author

Thank you very much! I have managed to solve the issue with your instructions.

I have a new problem with cases that have been tokenized in this way:

5.bis -> ["5.bis"]
Madrid.de -> ["Madrid.de"]
2.de -> ["2.de"]

I would like to separate it into tokens as follows:

5.bis -> ["5", ".", "bis"]
Madrid.de -> ["Madrid", ".", "de"]
2.de -> ["2", ".", "de"]

I think it has to do with the Tokenizer interpreting them as part of a url or email, can it be? How could I fix it?

I have been able to solve it using the add_special_case() function for each of them, but I would like to know if it is possible to establish a general rule and not to act for each particular case.

Thanks!

polm Dec 26, 2022

If you want to handle all cases like this, you need to disable the URL match feature entirely, since things like Madrid.de are definitely valid domain names. You can do that like this:

nlp.tokenizer.url_match = lambda x: None

This will give you the results you want in the above examples assuming you have also added . to the list of infixes like in your initial code.

Note that because this will cause URLs to be tokenized a bunch, if your data contains any URLs you may get weird results, so you should check your output carefully after making these changes.

alvaromarlo Jan 2, 2023
Author

Thank you very much! I am getting the desired results.

However, I have one last question. Consecutive points are tokenized into a single token. How could I tokenize into one token each of them?

Current result:
.... -> ["...."]

Desired result:
.... -> [".", ".", ".", "."]

polm Jan 3, 2023

This is actually kind of tricky. There's prefix and suffix rules that match runs of periods as single tokens, but if you remove them it's still a single token because it's the same kind of character (just like "iii" is alphabetic characters). So then you have to add the single period as a prefix and suffix. Combining with some of your previous changes, that looks like this:

import spacy

nlp = spacy.blank("es")

infixes = nlp.Defaults.infixes + ["[.]"]
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

prefixes = list(nlp.Defaults.prefixes)
prefixes.remove("\\.\\.+")
prefixes.append("\\.")
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

suffixes = list(nlp.Defaults.suffixes)
suffixes.remove("\\.\\.+")
suffixes.append("\\.")
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

nlp.tokenizer.url_match = lambda x: None

text = "...."
print(nlp.tokenizer.explain(text))
for tok in nlp(text):
    print(tok)

I haven't tested this extensively, so it's likely it'll cause side effects - you should test this on your corpus.

Out of curiosity, why do you want to split up periods in sequence like that? Do you have a specific use case in mind?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom Tokenizer Help #12018

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom Tokenizer Help #12018

Uh oh!

Uh oh!

alvaromarlo Dec 22, 2022

Replies: 1 comment · 4 replies

Uh oh!

polm Dec 23, 2022

Uh oh!

Uh oh!

alvaromarlo Dec 23, 2022 Author

Uh oh!

polm Dec 26, 2022

Uh oh!

alvaromarlo Jan 2, 2023 Author

Uh oh!

polm Jan 3, 2023

alvaromarlo
Dec 22, 2022

Replies: 1 comment 4 replies

polm
Dec 23, 2022

alvaromarlo Dec 23, 2022
Author

alvaromarlo Jan 2, 2023
Author