Sentence segmentation performance #8532
Replies: 2 comments 5 replies
-
Your file is 432kb, and 67k whitespace separated words, which is solid novel-length text. I don't think 15s for processing that into sentences is slow. That said, there are some things you can do to speed it up. For one you can use the rule-based Sentencizer instead of the senter. That would look like this:
Additionally your entire file is a single line, and seems to be concatenated XML files. It won't help much with the sentencizer, but in general you might have more success pre-splitting your text into smaller chunks, like individual XML files. Also not sure if it's intentional, but note that even if you disable the parser there are other components that take time to run still there. Take a look at #8402 for general speed advice. |
Beta Was this translation helpful? Give feedback.
-
Consider the following sentence:
(a) The closing of the sale of the Shares (the Closing) shall take place at
the offices of White & Case LLP, 1221 Avenue of the Americas, New York, New
York 10020, or by electronic transmittal of executed documents, as soon as
practicable, but in any event, at 10:00 a.m. (New York City time) on the
second (2nd) Business Day after the last of the conditions set forth in
Article VII.
which gets broken up into 2 sentences (sentence break after (a)).
(a)
*<sentence_break>*
The closing of the sale of the Shares (the Closing) shall take place at the
offices of White & Case LLP, 1221 Avenue of the Americas, New
York, New York 10020, or by electronic transmittal of executed documents,
as soon as practicable, but in any event, at 10:00 a.m. (New York City
time) on the second (2nd) Business Day after the last of the conditions set
forth in Article VII.
Is there any way to correct this?
…On Wed, Jun 30, 2021 at 10:47 PM polm ***@***.***> wrote:
Trouble around quotation marks may have to do with the way the training
data is prepared - the original data is pre-tokenized and doesn't include
spaces, so it can have trouble with opening and closing quotation marks. We
are looking at strategies to improve this.
Quotation marks in general make sentence tokenization hard. Your sentences
are easy cases, and ideally should be handled correctly, but for things
like the below it can be unclear how many sentences you want:
He said, "I see that." # probably one sentence
He said, "I see that. I can see it clearly now." # one sentence? three
sentences?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#8532 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5PW7CVCHKGI4RJJHHJJB3TVP6O3ANCNFSM47OTEHIA>
.
--
Regards,
Ronny Shahrabani
Tel: +1.310.271.2094
www.reportalsoftware.com
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
mic.txt
I am using SpaCy v 3.0 and the following code to segment text into sentences:
The attached file takes roughly 15 seconds for processing into individual sentences. Is there any way we can boost the performance of this feature?
Beta Was this translation helpful? Give feedback.
All reactions