Skip to content
Discussion options

You must be logged in to vote

This is a bit tricky because what you want to do here is remove an existing special case.

First, you can figure out what's going on using the explain method:

nlp.tokenizer.explain("Secretario/a.")
# => [('TOKEN', 'Secretario'), ('INFIX', '/'), ('SPECIAL-1', 'a.')]

SPECIAL refers to a tokenizer exception in this case.

You can remove a tokenizer exception by modifying tokenizer.rules, like this:

rules = nlp.tokenizer.rules
del rules["a."]
nlp.tokenizer.rules = rules

In your case, it sounds like you might want to do this for every lowercase letter. That should be easy to do in a for loop, but if you want to remove more exceptions then it might be better to make your own language subclass w…

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@alvaromarlo
Comment options

@polm
Comment options

@alvaromarlo
Comment options

@polm
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / es Spanish language data and models feat / tokenizer Feature: Tokenizer
2 participants