Does en_core_web_trf truncate documents to 512? #9280

ginward · 2021-09-23T04:48:27Z

ginward
Sep 23, 2021

It seems that the documentation does not specify if en_core_web_trf truncates texts longer than 512 (or any other transformer limits).

Is there such a truncation being performed?

Answered by adrianeboyd

Sep 23, 2021

By default the transformer component uses overlapping strided spans (see: https://spacy.io/api/transformer#span_getters) so you can train and predict on longer texts without issues on transformer models that have a fixed max length.

Splitting long documents into sentences is something that happens just during training with the corpus reader, and the sentences come from the underlying corpus annotation, not from a component in the pipeline. Those settings aren't related to what happens when you run en_core_web_trf on a new text.

View full answer

ginward · 2021-09-23T04:51:18Z

ginward
Sep 23, 2021
Author

This is related to #6939 and #7094

0 replies

ginward · 2021-09-23T05:02:27Z

ginward
Sep 23, 2021
Author

Or is it fine if each sentence is less than 512 length?

0 replies

ginward · 2021-09-23T05:30:37Z

ginward
Sep 23, 2021
Author

Although it says here, that Spacy splits documents longer than 512 into sentences first before feeding it to the en_core_web_trf model, I am not too sure (from the documentation) that if the en_core_web_trf model uses transformer to segment sentences. If so, wouldn't documents longer than 512 length not be correctly sentenced?

Although my experiment here does not seem to have a problem:

import spacy
nlp_ent = spacy.load("en_core_web_trf")
nlp_ent.add_pipe("merge_entities")
text = "I love eating burgers. I live in a house." * 1000
doc = nlp_ent(text)
super_word_ls = []
for s in doc.sents:
    word_ls = []
    for t in s:
        if not t.ent_type_:
            if (t.text.strip()!=""):
                word_ls.append(t.text)
        else:
            word_ls.append(t.ent_type_)
    if len(word_ls)>0:
        super_word_ls.append(" ".join(word_ls))
len(super_word_ls)

0 replies

adrianeboyd · 2021-09-23T13:03:14Z

adrianeboyd
Sep 23, 2021

By default the transformer component uses overlapping strided spans (see: https://spacy.io/api/transformer#span_getters) so you can train and predict on longer texts without issues on transformer models that have a fixed max length.

Splitting long documents into sentences is something that happens just during training with the corpus reader, and the sentences come from the underlying corpus annotation, not from a component in the pipeline. Those settings aren't related to what happens when you run en_core_web_trf on a new text.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Does en_core_web_trf truncate documents to 512? #9280

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Does en_core_web_trf truncate documents to 512? #9280

Uh oh!

Uh oh!

ginward Sep 23, 2021

Replies: 4 comments

Uh oh!

Uh oh!

ginward Sep 23, 2021 Author

Uh oh!

ginward Sep 23, 2021 Author

Uh oh!

Uh oh!

ginward Sep 23, 2021 Author

Uh oh!

Uh oh!

adrianeboyd Sep 23, 2021

ginward
Sep 23, 2021

ginward
Sep 23, 2021
Author

ginward
Sep 23, 2021
Author

ginward
Sep 23, 2021
Author

adrianeboyd
Sep 23, 2021