SpanCategorizer vs TextCategorizer #7977

Phat-Loc · 2021-05-01T17:46:27Z

Phat-Loc
May 1, 2021

I have documents that are split into sentences. For each of the sentence I would like to apply a category. Can I achieve this with TextCategorizer or should I wait for SpanCategorizer? If I go down the the TextCategorizer do I iterate over the sents of the original doc object then make a new doc for each sentence and apply TextCategorizer to them? I want to retain the source document structure though. Any suggestions?

Answered by polm

May 3, 2021

For reference SpanCategorizer dev is in #6747.

I'm not working on that feature so I could be wrong but it sounds like it's not quite developed with your use case in mind - the typical spans are like NER spans, and any proposed span can be given a non-of-the-above category. That doesn't mean it wouldn't work, just that the features it provides might be less important to you. In particular you have no overlapping spans and every sentence must have a category label.

What you could do that would work now is just create a doc out of each sentence, and add an original_doc_id as a document attribute. That would allow you to reconstruct the original documents from the sentences, perhaps using Doc…

View full answer

polm · 2021-05-03T05:56:53Z

polm
May 3, 2021

For reference SpanCategorizer dev is in #6747.

I'm not working on that feature so I could be wrong but it sounds like it's not quite developed with your use case in mind - the typical spans are like NER spans, and any proposed span can be given a non-of-the-above category. That doesn't mean it wouldn't work, just that the features it provides might be less important to you. In particular you have no overlapping spans and every sentence must have a category label.

What you could do that would work now is just create a doc out of each sentence, and add an original_doc_id as a document attribute. That would allow you to reconstruct the original documents from the sentences, perhaps using Doc.from_docs with some custom handling for category labels.

0 replies

lmanhes · 2021-05-06T11:00:29Z

lmanhes
May 6, 2021

Hi,

After you have trained a TextCategorizer ("my_text_classifier") on sentence-like documents, you can define a custom component such as :

import numpy as np
from spacy.language import Language
from spacy.tokens import Doc, Span
from spacy.util import minibatch


@Language.factory("my_sent_classifier")
class MySentClassifier(object):

    def __init__(self, name: str, nlp: Language):
        self.nlp = nlp

        Span.set_extension(name="my_label", default=None)
        self.my_sent_clf = nlp.get_pipe("my_text_classifier")

    def __call__(self, doc: Doc) -> Doc:
        sentences = list(doc.sents)

        sents_docs = []
        with self.nlp.select_pipes(enable=[]):
            for s in self.nlp.pipe([s.text for s in sentences]):
                sents_docs.append(s)

        idx = 0
        for batch_sents in minibatch(sents_docs, size=128):
            batch_sents = list(batch_sents)

            scores_prob = self.my_sent_clf.predict(batch_sents)
            scores_idx = np.argmax(scores_prob, -1)

            for i in range(len(batch_sents)):
                sentences[idx+i]._.my_label = self.my_sent_clf.labels[scores_idx[i]]

            idx += len(batch_sents)

        return doc

This is a temporary hack, kind of slow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SpanCategorizer vs TextCategorizer #7977

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

SpanCategorizer vs TextCategorizer #7977

Uh oh!

Phat-Loc May 1, 2021

Replies: 2 comments

Uh oh!

polm May 3, 2021

Uh oh!

Uh oh!

lmanhes May 6, 2021

Phat-Loc
May 1, 2021

polm
May 3, 2021

lmanhes
May 6, 2021