Adding multiple special cases for Spacy tokenizer #12741
Unanswered
vrunm
asked this question in
Help: Coding & Implementations
Replies: 1 comment
-
If you would like to like add multiple rules to your tokenizer, then I would suggest writing a |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to segment text in a txt file (utf-8) into sentences using Spacy. It segments sentences with abbreviations (e.g., Mr., Dr., etc.) as separate sentences when it is meant to read as a single sentence. For example: 'Mr. John Doe says' becomes Sentence 0: Dr. Sentence 1: Jane Doe says
I tried to use nlp.tokenizer.add_special_case to recognize Dr. as a special case, and it works for one case (code below). BUT because I have many abbreviations in the rest of the dataset, I would like to have a list of abbreviations (preferably from a text file but really just a list is fine) where it adds everything on the list as special cases.
This is my code:
Beta Was this translation helpful? Give feedback.
All reactions