TFR inconsistent and wrong break in doc.sents iterator. #12147

ldmtwo · 2023-01-18T22:10:42Z

ldmtwo
Jan 18, 2023

The issue here is that TFR breaks at odd places and inconsistently. When the text was short, it worked. I had a long paragraph that I expected it to parse into sentences. I have to use markers in the text to identify split points between XML tags. To do so, I used <pad> as a non-whitespace delimiter so I can reverse the transformation. For some reason (attention?), it will disconnect part of the tags and become pad>. The SM version seems to be working. White space doesn't matter, but those tags and other visible characters do. Ideally, I would like to just have a list of indices of where each sentence starts. I cannot fix the input and I've already done the work to create a reversible XML text extraction process.

How to reproduce the behaviour

import spacy

## fails sometimes
# nlp = spacy.load("en_core_web_trf")
## works
# nlp = spacy.load("en_core_web_sm")

def get_sentences(text):
    doc = nlp(text)
    return [sent for sent in doc.sents if True]
line = '-'*100
## fails
text = 'Word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word.  The Arbitration Act 1996 will apply. [The person appointed will act as expert and not as arbitrator.] <pad>  [the [exclusive OR non exclusive] jurisdiction of the courts of England and Wales.] <pad>   '
## works
text = 'The Arbitration Act 1996 will apply. [The person appointed will act as expert and not as arbitrator.] <pad>  [the [exclusive OR non exclusive] jurisdiction of the courts of England and Wales.] <pad>   '
for out in get_sentences(text):
    print(f'{line}\n{out}\n')

Output 1:

----------------------------------------------------------------------------------------------------
Word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word.

----------------------------------------------------------------------------------------------------
 The Arbitration Act 1996 will apply.

----------------------------------------------------------------------------------------------------
[The person appointed will act as expert and not as arbitrator.]

----------------------------------------------------------------------------------------------------
<

----------------------------------------------------------------------------------------------------
pad>

----------------------------------------------------------------------------------------------------
 [the [exclusive OR non exclusive] jurisdiction of the courts of England and Wales.]

----------------------------------------------------------------------------------------------------
<

----------------------------------------------------------------------------------------------------
pad>

----------------------------------------------------------------------------------------------------

Output 2:

----------------------------------------------------------------------------------------------------
The Arbitration Act 1996 will apply.

----------------------------------------------------------------------------------------------------
[The person appointed will act as expert and not as arbitrator.] <pad>

----------------------------------------------------------------------------------------------------
 [the [exclusive OR non exclusive] jurisdiction of the courts of England and Wales.] <pad>

----------------------------------------------------------------------------------------------------

Your Environment

Operating System: Mac M1 2022
Python Version Used: Python 3.9.12
spaCy Version Used: just updated today
Environment Information:

Answered by adrianeboyd

Jan 23, 2023

A big part of the reason is that the provided trained pipelines aren't trained on texts that include XML tags like this, so you'll get fairly unpredictable results. In general it would be better to store this information is some other form than inserting special tokens in the text, especially if you want to use the provided English pipelines and not a custom model that is trained on texts containing these kinds of tokens.

Beyond that, there are a couple things going on:

The default English tokenizer settings split <pad> into three tokens < pad >. If you want <pad> to be a single token, you can add it as a tokenizer exception ("special case"): https://spacy.io/usage/linguistic-features#s…

View full answer

Nickersoft · 2023-01-21T06:47:36Z

Nickersoft
Jan 21, 2023

Just updated to 3.5.0 and can confirm I'm seeing this same behavior – I had a test for these custom sentence breaks, and before, when giving an index of 7, it would break "I love my girlfriend" into ["I love", "my girlfriend."]. Now it breaks it into ["I love", "my", "girlfriend."].

0 replies

adrianeboyd · 2023-01-23T09:25:02Z

adrianeboyd
Jan 23, 2023

A big part of the reason is that the provided trained pipelines aren't trained on texts that include XML tags like this, so you'll get fairly unpredictable results. In general it would be better to store this information is some other form than inserting special tokens in the text, especially if you want to use the provided English pipelines and not a custom model that is trained on texts containing these kinds of tokens.

Beyond that, there are a couple things going on:

The default English tokenizer settings split <pad> into three tokens < pad >. If you want <pad> to be a single token, you can add it as a tokenizer exception ("special case"): https://spacy.io/usage/linguistic-features#special-cases. You can also modify prefixes and suffixes if you'd like to do something more general with < and >.
If you know that the token <pad> is never the start of a sentence, you can add a custom component before the parser to preprocess your docs and mark token.is_sent_start = False for all these tokens.
Don't use <pad> with en_core_web_trf because it is a special token for the internal transformer model roberta-base (and for lots of related models). The transformer tokenizer encoding is not able to distinguish special tokens from the same string in the input text, so I suspect this is the source of some of the unexpected results.

You can see all the special tokens (I'm not sure how stable/consistent the transformers API is for all transformer model types, but this works for the current transformers + current en_core_news_trf):
```
print(nlp.get_pipe("transformer").model.tokenizer.special_tokens_map)
```
```
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}
```

4 replies

Nickersoft Feb 2, 2023

While this makes sense, what could cause the regression I mentioned that doesn't use any HTML tags in the document? It's been preventing me from upgrading.

rmitsch Feb 3, 2023
Maintainer

Could you provide a minimal, reproducible example so we can look into this?

Nickersoft Mar 26, 2023

@rmitsch Sorry for the delay here – other things on my plate ended up taking priority. I got a working example together!

Repl here: https://replit.com/@Nickersoft/spacy-repro.

Basically, just fork it and run it to see the result for the Spacy 3.4.4 model ("I love" + "my girlfriend"). Then, upgrade the pyproject.toml file to use Spacy 3.5.1 and in the shell run:

$ poetry update
$ poetry run spacy download en_core_web_md

to get the new 3.5.1 model and rerun.

Now the sentence is broken into:

"I love" + "my" + "girlfriend"

adrianeboyd Mar 29, 2023

It would have much easier for us to try out the example if you'd pasted a short script in the thread as a code block instead.

I think the parser hasn't seen many examples like this while training, so adding a sentence break in the middle of a phrase leads to unpredictable/unstable results near the new sentence break, even for models trained on the same training data. The sentences in the training data are mostly either headlines (without final punctuation) or full newspaper-ish sentences with final punctuation.

I don't think it makes sense to have a test that depends on this exact output unless you've also pinned the exact spacy model version for your script/package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TFR inconsistent and wrong break in doc.sents iterator. #12147

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

TFR inconsistent and wrong break in doc.sents iterator. #12147

Uh oh!

ldmtwo Jan 18, 2023

How to reproduce the behaviour

Your Environment

Replies: 2 comments · 4 replies

Uh oh!

Nickersoft Jan 21, 2023

Uh oh!

adrianeboyd Jan 23, 2023

Uh oh!

Uh oh!

Nickersoft Feb 2, 2023

Uh oh!

rmitsch Feb 3, 2023 Maintainer

Uh oh!

Nickersoft Mar 26, 2023

Uh oh!

adrianeboyd Mar 29, 2023

ldmtwo
Jan 18, 2023

Replies: 2 comments 4 replies

Nickersoft
Jan 21, 2023

adrianeboyd
Jan 23, 2023

rmitsch Feb 3, 2023
Maintainer