TFR inconsistent and wrong break in doc.sents iterator. #12147
-
The issue here is that TFR breaks at odd places and inconsistently. When the text was short, it worked. I had a long paragraph that I expected it to parse into sentences. I have to use markers in the text to identify split points between XML tags. To do so, I used How to reproduce the behaviour
Output 1:
Output 2:
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Just updated to 3.5.0 and can confirm I'm seeing this same behavior – I had a test for these custom sentence breaks, and before, when giving an index of 7, it would break "I love my girlfriend" into ["I love", "my girlfriend."]. Now it breaks it into ["I love", "my", "girlfriend."]. |
Beta Was this translation helpful? Give feedback.
-
A big part of the reason is that the provided trained pipelines aren't trained on texts that include XML tags like this, so you'll get fairly unpredictable results. In general it would be better to store this information is some other form than inserting special tokens in the text, especially if you want to use the provided English pipelines and not a custom model that is trained on texts containing these kinds of tokens. Beyond that, there are a couple things going on:
|
Beta Was this translation helpful? Give feedback.
A big part of the reason is that the provided trained pipelines aren't trained on texts that include XML tags like this, so you'll get fairly unpredictable results. In general it would be better to store this information is some other form than inserting special tokens in the text, especially if you want to use the provided English pipelines and not a custom model that is trained on texts containing these kinds of tokens.
Beyond that, there are a couple things going on:
The default English tokenizer settings split
<pad>
into three tokens< pad >
. If you want<pad>
to be a single token, you can add it as a tokenizer exception ("special case"): https://spacy.io/usage/linguistic-features#s…