Setting span_getter for Pretrained Transformer Model. #12712
Replies: 2 comments
-
One possibility seems to be to separate the sentencizer and the remainder of the pipeline and then feed the sentences into nlp.pipe(). Based on my initial experiments this list(nlp.pipe(sentences)) is much faster than[nlp(sent) for sent in sentences] which I hope indicates appropriate batch processing. This also gives the possibility of changing out the sentence segmenter. When I last did an evaluations, SpaCy's default sentence segmentation was the worst I tested. I can tell from my experiments here that sentence segmentation is much better now. |
Beta Was this translation helpful? Give feedback.
-
Hey awindsor, Do I understand it correctly that you would like each |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am interested in using the Sentencizer to break a document into sentences before feeding them into the transformer in batches. I see the required components in the documentation but the documentation is encouraging the use of the config file. I am not trying to train the model at the moment just to retrieve the output of the pretained en_core_web_trf model applied to my sentences.
My current pipeline is
nlp = spacy.load(
"en_core_web_trf", disable=[ "parser", "ner"]
)
nlp.add_pipe("sentencizer")
I am hoping I can do it with a nlp.get_pipe('transformer')..... call but cannot see where to add a span getter.
I will either use the "spacy-transformers.sent_spans.v1" or the custom version defined in the documentation that handles long sentences by breaking them into shorter spans.
Beta Was this translation helpful? Give feedback.
All reactions