Adding multiple special cases for Spacy tokenizer #12741

vrunm · 2023-06-21T04:17:53Z

vrunm
Jun 21, 2023

I am trying to segment text in a txt file (utf-8) into sentences using Spacy. It segments sentences with abbreviations (e.g., Mr., Dr., etc.) as separate sentences when it is meant to read as a single sentence. For example: 'Mr. John Doe says' becomes Sentence 0: Dr. Sentence 1: Jane Doe says

I tried to use nlp.tokenizer.add_special_case to recognize Dr. as a special case, and it works for one case (code below). BUT because I have many abbreviations in the rest of the dataset, I would like to have a list of abbreviations (preferably from a text file but really just a list is fine) where it adds everything on the list as special cases.

This is my code:

import spacy
import pathlib
from spacy.attrs import ORTH, NORM

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.add_special_case('Dr.', [{ORTH: 'Dr .', NORM: 'Doctor'}])

file_name = r"text_test_sentence.txt" #filename of textfile to split
doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
sentences = list (doc.sents)

kadarakos · 2023-06-21T09:08:19Z

kadarakos
Jun 21, 2023

If you would like to like add multiple rules to your tokenizer, then I would suggest writing a for loop over a list that stores all the various abbreviations that you would like to add to the special cases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding multiple special cases for Spacy tokenizer #12741

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Adding multiple special cases for Spacy tokenizer #12741

Uh oh!

Uh oh!

vrunm Jun 21, 2023

Replies: 1 comment

Uh oh!

kadarakos Jun 21, 2023

vrunm
Jun 21, 2023

kadarakos
Jun 21, 2023