English Sentenciser - Acronyms #8629
SpyriP
started this conversation in
Language Support
Replies: 1 comment
-
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I was running some examples through tokenisation, such as:
I understand that "U.S." should be one token as shown above. Since this is the case, why isn't it inside special cases in
tokenizer exceptions: https://github.com/explosion/spaCy/blame/master/spacy/lang/en/tokenizer_exceptions.py. Is there any specific exception for this?
Another similar case is the single-letter followed by dot, like: "B."
In the special cases, we have lowercase single letters, like: "a." "b." etc but there are no uppercases. Since the latter are not included in the special cases in the tokeniser, then why do we get the following output ? Am I missing something?
Beta Was this translation helpful? Give feedback.
All reactions