English Sentenciser - Acronyms #8629

SpyriP · 2021-07-07T10:19:52Z

SpyriP
Jul 7, 2021

I was running some examples through tokenisation, such as:

nlp.tokenizer.explain("U.S.")
[('TOKEN', 'U.S.')]

I understand that "U.S." should be one token as shown above. Since this is the case, why isn't it inside special cases in
tokenizer exceptions: https://github.com/explosion/spaCy/blame/master/spacy/lang/en/tokenizer_exceptions.py. Is there any specific exception for this?

Another similar case is the single-letter followed by dot, like: "B."
In the special cases, we have lowercase single letters, like: "a." "b." etc but there are no uppercases. Since the latter are not included in the special cases in the tokeniser, then why do we get the following output ? Am I missing something?

nlp.tokenizer.explain("He is working on the plan B.")
[('TOKEN', 'He'), ('TOKEN', 'is'), ('TOKEN', 'working'), ('TOKEN', 'on'), ('TOKEN', 'the'), ('TOKEN', 'plan'), ('TOKEN', 'B.')]

adrianeboyd · 2021-07-07T11:14:14Z

adrianeboyd
Jul 7, 2021

TOKEN means that it's not from an exception (those say SPECIAL). There are suffix rules that don't split off . after certain patterns (single capital letters like middle initials are one of the exceptions), so the form is left as is as comes out as a token because nothing else was split off. TOKEN is what is left after all the patterns/rules are applied.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

English Sentenciser - Acronyms #8629

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

English Sentenciser - Acronyms #8629

Uh oh!

Uh oh!

SpyriP Jul 7, 2021

Replies: 1 comment

Uh oh!

adrianeboyd Jul 7, 2021

SpyriP
Jul 7, 2021

adrianeboyd
Jul 7, 2021