Abbreviations Expansion #8541
Replies: 2 comments
-
The exception with the space should work in spaCy v3.x. It isn't supported by the tokenizer in spaCy v2. It looks like this particular additional exception was added after You can potentially modify the tokenizer in |
Beta Was this translation helpful? Give feedback.
-
Thank you, that helps a lot. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Inside "spacy/lang/es/tokenizer_exceptions.py" for Spanish language, there are the following options for the abbreviation of "Estados Unidos" :
Can we have composite expression (ie multiple characters with space) in the tokenizer exception? I ran some tests using the above expressions and it doesn't seem it works when I add the following:
"Ee.Uu.",
"EE. UU.",
"Ee. Uu.",
returns:
['esta', 'bien', 'EE', '.', 'UU', '.']
returns:
['esta', 'bien', 'Ee', '.', 'Uu', '.']
Only the expression "EE.UU." (without space) returns the correct output:
['esta', 'bien', '10', 'EE.UU.']
Should we expand them in the 'expand' module instead?
Beta Was this translation helpful? Give feedback.
All reactions