Sentence segmentation performance #8532

rshahrabani · 2021-06-28T19:35:42Z

rshahrabani
Jun 28, 2021

I am using SpaCy v 3.0 and the following code to segment text into sentences:

nlp = spacy.load('en_core_web_lg')
nlp.disable_pipe("parser")
nlp.enable_pipe("senter")

# read the text from the attached file
f = open("mic.txt", "r")
text = f.read()

# takes 15 seconds to process
doc = nlp(text)

The attached file takes roughly 15 seconds for processing into individual sentences. Is there any way we can boost the performance of this feature?

polm · 2021-06-29T04:31:02Z

polm
Jun 29, 2021

Your file is 432kb, and 67k whitespace separated words, which is solid novel-length text. I don't think 15s for processing that into sentences is slow. That said, there are some things you can do to speed it up.

For one you can use the rule-based Sentencizer instead of the senter. That would look like this:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
for sent in nlp(text).sents:
    ... do stuff ...

Additionally your entire file is a single line, and seems to be concatenated XML files. It won't help much with the sentencizer, but in general you might have more success pre-splitting your text into smaller chunks, like individual XML files.

Also not sure if it's intentional, but note that even if you disable the parser there are other components that take time to run still there. Take a look at #8402 for general speed advice.

4 replies

rshahrabani Jun 30, 2021
Author

I excluded the entire pipeline as follows:

nlp = spacy.load('en_core_web_lg', exclude=['tagger', 'parser', 'ner', 'lemmatizer', 'attribute_ruler', 'tok2vec'])
nlp.enable_pipe("senter")

and it sped up the process by quite a large number (2.5 seconds vs 15 seconds).

However, I noticed some errors in the segmentation (these same errors were also present even with the full pipeline, or full pipeline excluding the parser.

For some reason, the segmenter is creating sentence breaks (marked with the text <sentence_break>) after an opening quotation mark as below (these sentences are contiguous in the mic.txt file previously attached to this thread):

"
<sentence_break>
Agreement" shall have the meaning given to it in the preamble to this Agreement. "
<sentence_break>
Airport Authority" means any Person or entity responsible for the management, operation, or oversight of one or more airports. "
<sentence_break>
Airport Lease" means any lease or sublease pursuant to which an Airport Authority is the lessor. "
<sentence_break>
Alternative Financing" shall have the meaning given to it in Section 6.18(a).
<sentence_break>
"
<sentence_break>
Alternative Transaction Agreement" shall have the meaning given to it in Section 6.21(c).

Do you know what might be causing this?

rshahrabani Jun 30, 2021
Author

The input text for my last post is:

'"Agreement" shall have the meaning given to it in the preamble to this Agreement. "Airport Authority" means any Person or entity responsible for the management, operation, or oversight of one or more airports. "Airport Lease" means any lease or sublease pursuant to which an Airport Authority is the lessor. "Alternative Financing" shall have the meaning given to it in Section 6.18(a). "Alternative Transaction Agreement" shall have the meaning given to it in Section 6.21(c).'

polm Jul 1, 2021

Trouble around quotation marks may have to do with the way the training data is prepared - the original data is pre-tokenized and doesn't include spaces, so it can have trouble with opening and closing quotation marks. We are looking at strategies to improve this.

Quotation marks in general make sentence tokenization hard. Your sentences are easy cases, and ideally should be handled correctly, but for things like the below it can be unclear how many sentences you want:

He said, "I see that." # probably one sentence
He said, "I see that. I can see it clearly now." # one sentence? three sentences?

adrianeboyd Jul 1, 2021

The senter should be relatively easy to fine-tune or train from scratch.

If you'd like better performance on these kinds of sentences with ASCII quotes, I'd recommend training from scratch following the suggestions related to adding SPACY here: #6926 (comment)

It's pretty easy to generate training data by using the senter (or parser or whatever) to split texts into sentences and correct any incorrect boundaries. I'd typically recommend working with a simple text format with one sentence per line for this part (or you can use other tools like the prodigy v1.11 nightly, which has new sent recipes). Then you can generate training docs from a list of sentences as described in the top post in the same thread using Doc.from_docs: #6926

In the future, we'll see if it makes sense to add SPACY by default as a feature to models like en_core_web_sm, which I think would help in this case.

rshahrabani · 2021-07-07T02:14:41Z

rshahrabani
Jul 7, 2021
Author

Consider the following sentence: (a) The closing of the sale of the Shares (the Closing) shall take place at the offices of White & Case LLP, 1221 Avenue of the Americas, New York, New York 10020, or by electronic transmittal of executed documents, as soon as practicable, but in any event, at 10:00 a.m. (New York City time) on the second (2nd) Business Day after the last of the conditions set forth in Article VII. which gets broken up into 2 sentences (sentence break after (a)). (a) *<sentence_break>* The closing of the sale of the Shares (the Closing) shall take place at the offices of White & Case LLP, 1221 Avenue of the Americas, New York, New York 10020, or by electronic transmittal of executed documents, as soon as practicable, but in any event, at 10:00 a.m. (New York City time) on the second (2nd) Business Day after the last of the conditions set forth in Article VII. Is there any way to correct this?

…

On Wed, Jun 30, 2021 at 10:47 PM polm ***@***.***> wrote: Trouble around quotation marks may have to do with the way the training data is prepared - the original data is pre-tokenized and doesn't include spaces, so it can have trouble with opening and closing quotation marks. We are looking at strategies to improve this. Quotation marks in general make sentence tokenization hard. Your sentences are easy cases, and ideally should be handled correctly, but for things like the below it can be unclear how many sentences you want: He said, "I see that." # probably one sentence He said, "I see that. I can see it clearly now." # one sentence? three sentences? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8532 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI5PW7CVCHKGI4RJJHHJJB3TVP6O3ANCNFSM47OTEHIA> .

-- Regards, Ronny Shahrabani Tel: +1.310.271.2094 www.reportalsoftware.com

1 reply

polm Jul 7, 2021

As Adriane mentioned it's not very difficult to train your own model.

If your only problem is sentence breaks after list entries like that, then you can just add some post-processing that checks if a sentence looks like "(a)", and if so, combines it with the next sentence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Sentence segmentation performance #8532

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Sentence segmentation performance #8532

Uh oh!

Uh oh!

rshahrabani Jun 28, 2021

Replies: 2 comments · 5 replies

Uh oh!

polm Jun 29, 2021

Uh oh!

rshahrabani Jun 30, 2021 Author

Uh oh!

rshahrabani Jun 30, 2021 Author

Uh oh!

polm Jul 1, 2021

Uh oh!

adrianeboyd Jul 1, 2021

Uh oh!

rshahrabani Jul 7, 2021 Author

Uh oh!

polm Jul 7, 2021

rshahrabani
Jun 28, 2021

Replies: 2 comments 5 replies

polm
Jun 29, 2021

rshahrabani Jun 30, 2021
Author

rshahrabani Jun 30, 2021
Author

rshahrabani
Jul 7, 2021
Author