SpanCategorizer vs TextCategorizer #7977
-
I have documents that are split into sentences. For each of the sentence I would like to apply a category. Can I achieve this with TextCategorizer or should I wait for SpanCategorizer? If I go down the the TextCategorizer do I iterate over the sents of the original doc object then make a new doc for each sentence and apply TextCategorizer to them? I want to retain the source document structure though. Any suggestions? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
For reference SpanCategorizer dev is in #6747. I'm not working on that feature so I could be wrong but it sounds like it's not quite developed with your use case in mind - the typical spans are like NER spans, and any proposed span can be given a non-of-the-above category. That doesn't mean it wouldn't work, just that the features it provides might be less important to you. In particular you have no overlapping spans and every sentence must have a category label. What you could do that would work now is just create a doc out of each sentence, and add an |
Beta Was this translation helpful? Give feedback.
-
Hi, After you have trained a TextCategorizer ("my_text_classifier") on sentence-like documents, you can define a custom component such as : import numpy as np
from spacy.language import Language
from spacy.tokens import Doc, Span
from spacy.util import minibatch
@Language.factory("my_sent_classifier")
class MySentClassifier(object):
def __init__(self, name: str, nlp: Language):
self.nlp = nlp
Span.set_extension(name="my_label", default=None)
self.my_sent_clf = nlp.get_pipe("my_text_classifier")
def __call__(self, doc: Doc) -> Doc:
sentences = list(doc.sents)
sents_docs = []
with self.nlp.select_pipes(enable=[]):
for s in self.nlp.pipe([s.text for s in sentences]):
sents_docs.append(s)
idx = 0
for batch_sents in minibatch(sents_docs, size=128):
batch_sents = list(batch_sents)
scores_prob = self.my_sent_clf.predict(batch_sents)
scores_idx = np.argmax(scores_prob, -1)
for i in range(len(batch_sents)):
sentences[idx+i]._.my_label = self.my_sent_clf.labels[scores_idx[i]]
idx += len(batch_sents)
return doc This is a temporary hack, kind of slow |
Beta Was this translation helpful? Give feedback.
For reference SpanCategorizer dev is in #6747.
I'm not working on that feature so I could be wrong but it sounds like it's not quite developed with your use case in mind - the typical spans are like NER spans, and any proposed span can be given a non-of-the-above category. That doesn't mean it wouldn't work, just that the features it provides might be less important to you. In particular you have no overlapping spans and every sentence must have a category label.
What you could do that would work now is just create a doc out of each sentence, and add an
original_doc_id
as a document attribute. That would allow you to reconstruct the original documents from the sentences, perhaps using Doc…