Setting span_getter for Pretrained Transformer Model. #12712

awindsor · 2023-06-11T16:42:28Z

awindsor
Jun 11, 2023

I am interested in using the Sentencizer to break a document into sentences before feeding them into the transformer in batches. I see the required components in the documentation but the documentation is encouraging the use of the config file. I am not trying to train the model at the moment just to retrieve the output of the pretained en_core_web_trf model applied to my sentences.

My current pipeline is

nlp = spacy.load(
"en_core_web_trf", disable=[ "parser", "ner"]
)
nlp.add_pipe("sentencizer")

I am hoping I can do it with a nlp.get_pipe('transformer')..... call but cannot see where to add a span getter.

I will either use the "spacy-transformers.sent_spans.v1" or the custom version defined in the documentation that handles long sentences by breaking them into shorter spans.

awindsor · 2023-06-11T18:59:35Z

awindsor
Jun 11, 2023
Author

One possibility seems to be to separate the sentencizer and the remainder of the pipeline and then feed the sentences into nlp.pipe(). Based on my initial experiments this list(nlp.pipe(sentences)) is much faster than[nlp(sent) for sent in sentences] which I hope indicates appropriate batch processing.

This also gives the possibility of changing out the sentence segmenter. When I last did an evaluations, SpaCy's default sentence segmentation was the worst I tested. I can tell from my experiments here that sentence segmentation is much better now.

0 replies

kadarakos · 2023-06-19T11:02:23Z

kadarakos
Jun 19, 2023

Hey awindsor,

Do I understand it correctly that you would like each Doc to be a single sentence?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Setting span_getter for Pretrained Transformer Model. #12712

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Setting span_getter for Pretrained Transformer Model. #12712

Uh oh!

Uh oh!

awindsor Jun 11, 2023

Replies: 2 comments

Uh oh!

awindsor Jun 11, 2023 Author

Uh oh!

kadarakos Jun 19, 2023

awindsor
Jun 11, 2023

awindsor
Jun 11, 2023
Author

kadarakos
Jun 19, 2023