Spacy V3 creats empty sentence that only contain spaces (e.g, newline, spaces) with default Dependency parser. #8715
Replies: 2 comments 9 replies
-
Don't think this is a bug per se. You seem to indicate that in these cases you KNOW that every line = a sentence. If the is the case, you can just preprocess the data and create a list of sentences by splitting on new lines and filtering empty entries. |
Beta Was this translation helpful? Give feedback.
-
This is weird, but basically behavior around any kind of "interesting" whitespace is typically going to be a little weird. spaCy's training data has been preprocessed so that there's no newlines, and in many cases no whitespace at all. Because spaCy doesn't throw away any of the input, if you include newlines they have to go somewhere, and it's not clear to me that making them their own "sentence" is wrong. If you don't want to count blank sentences then they're easy to discard in post-processing. Also note that the sentencizer is easy to train, so if you know you want to treat sentence tokenization in cases like this a particular way, you can just prep some training data and train your own model. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I noticed that in spacy v3, given a piece of text, a new sentence that only contains spaces can be extracted out if there is a new line(\n).
Please check the following example Case1, which is segmented to 3 sentences (incorrect). The second sentence just contains one token with a mix of "\n" and empty space. Case2, there are two newlines in between, which ends up 2 sentences (correct). Spacy v2 seems not to perform a segment like this, the empty space token is normally preserved to the previous sentence and I never experience with a sentence with only spaces. Thus, Spacy v3 might create more empty sentences, which can potentially break the package/user's existing algorithm that relies on sentence count (unless we manually remove those fake empty sentences afterward).
Does someone have a comment on this? Is there any reliable way to avoid creating sentences that only contain spaces?
Case1:
import spacy
nlp = spacy.load("en_core_web_sm")
text='''This is a sentence.
This is another sentence.
'''
doc = nlp(text)
assert doc.has_annotation("SENT_START")
index = 1
for sent in doc.sents:
print(f'{index}: {sent.text}')
index=index+1
Result:
1: This is a sentence.
2:
3: This is another sentence.
Case2
text='''This is a sentence.
This is another sentence.
'''
Result:
1: This is a sentence.
2:
This is another sentence.
Beta Was this translation helpful? Give feedback.
All reactions