What is the best way to segment sentences around newline characters? #8400

raqibhayder · 2021-06-16T04:07:54Z

raqibhayder
Jun 16, 2021

For the given text, I split the lines on the newline character and create docs out of each line like so:

text = 'Line1\nLine2\n\n\nLine5\n\nLine7\nLine8'
for line_number, line in enumerate(text.splitlines()):
  print(f"Line {line_number+1}: {line}")

Output:

Line 1: Line1
Line 2: Line2
Line 3: 
Line 4: 
Line 5: Line5
Line 6: 
Line 7: Line7
Line 8: Line8

I need to keep track of line numbers and this worked great on small documents (100 to 200 lines) as the processing times were within the given threshold. But as the document size gets bigger (1000 lines), splitting the lines and creating docs out of each line becomes very slow (this makes sense as there is constant overhead when creating each doc).

What made sense was to create a doc from the entire text at once, and use sentence segmentation to preserve the line numbers. Line n of the text would correspond to the Sentence n of the doc. From my rudimentary tests, I found that it's ~5x faster to do it this way. Below is where I have gotten to:

import spacy
from spacy import util
from spacy.pipeline import Sentencizer
from spacy.tokens import Doc


def custom_sentencizer(doc: Doc) -> Doc:
    sentencizer = Sentencizer(punct_chars=["\n"])
    return sentencizer(doc)

nlp = spacy.load("en_core_web_sm")

# Add a sentencizer that splits sentences on new line characters
nlp.add_pipe(custom_sentencizer, name="sentencizer", before='parser')

# Handle newline as an infix token
infixes = nlp.Defaults.infixes + tuple([r'''\n'''])
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = util.compile_infix_regex(infixes).finditer

text = 'Line1\nLine2\n\n\nLine5\n\nLine7\nLine8'
print(f"Starting with: {repr(text)}")
doc = nlp(text)
for line_number, line in enumerate(list(doc.sents)):
    print(f"Line {line_number + 1} : {repr(token.text)}")
print("--------------------------------------------------")
for token in doc:
    print(f"{token.is_sent_start} : {repr(token.text)}")

Output:

Starting with: 'Line1\nLine2\n\n\nLine5\n\nLine7\nLine8'
Line 1 : 'Line8'
Line 2 : 'Line8'
Line 3 : 'Line8'
Line 4 : 'Line8'
Line 5 : 'Line8'
--------------------------------------------------
True : 'Line1'
False : '\n'
True : 'Line2'
False : '\n'
False : '\n'
False : '\n'
True : 'Line5'
False : '\n'
False : '\n'
True : 'Line7'
False : '\n'
True : 'Line8'

I need there to be 8 sentences, but there are 5. A workaround I did was to preprocess the text and add a placeholder between consecutive newline characters.

def preprocess_text(text: str) -> str:
  processed_lines = []
  for line in text.splitlines():
    if not line or line.isspace():
      processed_lines.append("EMPTY_STRING")
    else:
      processed_lines.append(line.strip())
  return "\n".join(processed_lines)

import spacy
from spacy import util
from spacy.pipeline import Sentencizer
from spacy.tokens import Doc


def custom_sentencizer(doc: Doc) -> Doc:
    sentencizer = Sentencizer(punct_chars=["\n"])
    return sentencizer(doc)

nlp = spacy.load("en_core_web_sm")

# Add a sentencizer that splits sentences on new line characters
nlp.add_pipe(custom_sentencizer, name="sentencizer", before='parser')

# Handle newline as an infix token
infixes = nlp.Defaults.infixes + tuple([r'''\n'''])
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = util.compile_infix_regex(infixes).finditer

text = 'Line1\nLine2\n\n\nLine5\n\nLine7\nLine8'
print(f"Starting with: {repr(text)}")
print(f"After preprocessing: {repr(preprocess_text(text))}")
doc = nlp(preprocess_text(text))

for line_number, line in enumerate(list(doc.sents)):
    print(f"Line {line_number + 1} : {repr(line.text)}")

Output

Starting with: 'Line1\nLine2\n\n\nLine5\n\nLine7\nLine8'
After preprocessing: 'Line1\nLine2\nEMPTY_STRING\nEMPTY_STRING\nLine5\nEMPTY_STRING\nLine7\nLine8'
Line 1 : 'Line1\n'
Line 2 : 'Line2\n'
Line 3 : 'EMPTY_STRING\n'
Line 4 : 'EMPTY_STRING\n'
Line 5 : 'Line5\n'
Line 6 : 'EMPTY_STRING\n'
Line 7 : 'Line7\n'
Line 8 : 'Line8'

The problem with adding a placeholder text like "EMPTY_STRING" is that it will throw off the model as it was not trained with these placeholders.

Would it be possible to add a space ( ) between the \n\n like \n \n and mark the space( ) between the two newline characters as the beginning of the sentence by setting token.is_sent_start=True?

polm · 2021-06-16T04:56:32Z

polm
Jun 16, 2021

I would recommend against taking this approach. Sentence tokenization is not made to be controllable like that. If the prefixes like "Line 1:" are not part of your original text they are also going to hurt performance.

When you feel that spaCy is slow there are a couple of things we recommend:

Use nlp.pipe. This manages internal resources used when creating a doc so that creating a bunch of docs is faster. Use looks like this:

texts = ["this is doc 1", "this is another doc", ...]
for doc in nlp.pipe(texts):
    ... do stuff ...

Disable pipeline components you aren't using. Combined with the above that looks like this:

for doc in nlp.pipe(texts, disable=["parser"]):
    ... do stuff ...

This is in a tips section in the docs.

Also, can you clarify what you're actually doing that using sentences works? Are you using an NER model?

2 replies

polm Jun 16, 2021

I wrote an FAQ entry on this topic, since we have people run into it sometimes. #8402

raqibhayder Jun 16, 2021
Author

Use nlp.pipe. This manages internal resources used when creating a doc so that creating a bunch of docs is faster. Use looks like this:

Just tested this. works like a charm.

Also, can you clarify what you're actually doing that using sentences works? Are you using a NER model?

Yes. I am using a NER model.

@polm : Thank you for the FAQ and the quick response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What is the best way to segment sentences around newline characters? #8400

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

What is the best way to segment sentences around newline characters? #8400

Uh oh!

Uh oh!

raqibhayder Jun 16, 2021

Replies: 1 comment · 2 replies

Uh oh!

polm Jun 16, 2021

Uh oh!

polm Jun 16, 2021

Uh oh!

Uh oh!

raqibhayder Jun 16, 2021 Author

raqibhayder
Jun 16, 2021

Replies: 1 comment 2 replies

polm
Jun 16, 2021

raqibhayder Jun 16, 2021
Author