Spacy V3 creats empty sentence that only contain spaces (e.g, newline, spaces) with default Dependency parser. #8715

yangfan0356 · 2021-07-14T11:18:14Z

yangfan0356
Jul 14, 2021

I noticed that in spacy v3, given a piece of text, a new sentence that only contains spaces can be extracted out if there is a new line(\n).

Please check the following example Case1, which is segmented to 3 sentences (incorrect). The second sentence just contains one token with a mix of "\n" and empty space. Case2, there are two newlines in between, which ends up 2 sentences (correct). Spacy v2 seems not to perform a segment like this, the empty space token is normally preserved to the previous sentence and I never experience with a sentence with only spaces. Thus, Spacy v3 might create more empty sentences, which can potentially break the package/user's existing algorithm that relies on sentence count (unless we manually remove those fake empty sentences afterward).

Does someone have a comment on this? Is there any reliable way to avoid creating sentences that only contain spaces?

Case1:

import spacy

nlp = spacy.load("en_core_web_sm")
text='''This is a sentence.
This is another sentence.
'''
doc = nlp(text)
assert doc.has_annotation("SENT_START")
index = 1
for sent in doc.sents:
print(f'{index}: {sent.text}')
index=index+1

Result:
1: This is a sentence.
2:

3: This is another sentence.

Case2

text='''This is a sentence.

This is another sentence.
'''
Result:
1: This is a sentence.
2:

This is another sentence.

BramVanroy · 2021-07-15T13:40:31Z

BramVanroy
Jul 15, 2021

Don't think this is a bug per se. You seem to indicate that in these cases you KNOW that every line = a sentence. If the is the case, you can just preprocess the data and create a list of sentences by splitting on new lines and filtering empty entries.

1 reply

yangfan0356 Jul 15, 2021
Author

HI BramVanroy, thanks for the comment. I am not sure if it is a bug or not, but it generates empty sentences (\n) in some cases as illustrated above. Given a piece of text, I would assume a sentence segmenter can divide the text into multiple meaningful sentences, and space (or newline) shall belong to either the previous or next sentences. In spacy2, I never encounter this "empty sentence" problem. The above example just illustrated this issue, when doing text analysis in a real-world example, this situation might happen and I cannot assume everyline=a sentence...

polm · 2021-07-17T05:51:09Z

polm
Jul 17, 2021

This is weird, but basically behavior around any kind of "interesting" whitespace is typically going to be a little weird. spaCy's training data has been preprocessed so that there's no newlines, and in many cases no whitespace at all. Because spaCy doesn't throw away any of the input, if you include newlines they have to go somewhere, and it's not clear to me that making them their own "sentence" is wrong. If you don't want to count blank sentences then they're easy to discard in post-processing.

Also note that the sentencizer is easy to train, so if you know you want to treat sentence tokenization in cases like this a particular way, you can just prep some training data and train your own model.

8 replies

adrianeboyd Jul 19, 2021

It looks like the v2 parser has some hard-coded handling of whitespace tokens that's been removed from v3. (In general in v3, we've tried to remove the rule-based exceptions from the statistical components so that the pipeline is more flexible overall.)

But given training corpora for the pretrained pipeline that don't contain many whitespace tokens, we need to add exceptions (a related report is #8710) and/or add more data augmentation with whitespace tokens to improve the behavior here.

yangfan0356 Jul 19, 2021
Author

I see, thanks for the explanation. Agreed:)

yangfan0356 Mar 25, 2022
Author

But given training corpora for the pretrained pipeline that don't contain many whitespace tokens, we need to add exceptions (a related report is #8710) and/or add more data augmentation with whitespace tokens to improve the behavior here.

Hi Adrianeboyd,

Do you know if we have added any exceptions or something else to address this issue in the latest build? Thanks.

Fan

adrianeboyd Mar 25, 2022

We will be adding whitespace augmentation for v3.3.0, see the augmenter in #10170.

yangfan0356 Mar 28, 2022
Author

Thanks a lot, looks forward to that!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spacy V3 creats empty sentence that only contain spaces (e.g, newline, spaces) with default Dependency parser. #8715

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spacy V3 creats empty sentence that only contain spaces (e.g, newline, spaces) with default Dependency parser. #8715

Uh oh!

Uh oh!

yangfan0356 Jul 14, 2021

Case1:

Case2

Replies: 2 comments · 9 replies

Uh oh!

BramVanroy Jul 15, 2021

Uh oh!

yangfan0356 Jul 15, 2021 Author

Uh oh!

polm Jul 17, 2021

Uh oh!

adrianeboyd Jul 19, 2021

Uh oh!

yangfan0356 Jul 19, 2021 Author

Uh oh!

yangfan0356 Mar 25, 2022 Author

Uh oh!

adrianeboyd Mar 25, 2022

Uh oh!

yangfan0356 Mar 28, 2022 Author

yangfan0356
Jul 14, 2021

Replies: 2 comments 9 replies

BramVanroy
Jul 15, 2021

yangfan0356 Jul 15, 2021
Author

polm
Jul 17, 2021

yangfan0356 Jul 19, 2021
Author

yangfan0356 Mar 25, 2022
Author

yangfan0356 Mar 28, 2022
Author