What is the best way to segment sentences around newline characters? #8400
raqibhayder
started this conversation in
Help: Best practices
Replies: 1 comment 2 replies
-
I would recommend against taking this approach. Sentence tokenization is not made to be controllable like that. If the prefixes like "Line 1:" are not part of your original text they are also going to hurt performance. When you feel that spaCy is slow there are a couple of things we recommend:
This is in a tips section in the docs. Also, can you clarify what you're actually doing that using sentences works? Are you using an NER model? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
For the given text, I split the lines on the newline character and create docs out of each line like so:
Output:
I need to keep track of line numbers and this worked great on small documents (100 to 200 lines) as the processing times were within the given threshold. But as the document size gets bigger (1000 lines), splitting the lines and creating docs out of each line becomes very slow (this makes sense as there is constant overhead when creating each doc).
What made sense was to create a doc from the entire text at once, and use sentence segmentation to preserve the line numbers. Line n of the text would correspond to the Sentence n of the doc. From my rudimentary tests, I found that it's ~5x faster to do it this way. Below is where I have gotten to:
Output:
I need there to be 8 sentences, but there are 5. A workaround I did was to preprocess the text and add a placeholder between consecutive newline characters.
Output
The problem with adding a placeholder text like "EMPTY_STRING" is that it will throw off the model as it was not trained with these placeholders.
Would it be possible to add a space (
) between the
) between the two newline characters as the beginning of the sentence by setting
\n\n
like\n \n
and mark the space(token.is_sent_start=True
?Beta Was this translation helpful? Give feedback.
All reactions