Feature enhancement in large multi-label classification problem #8314

m-alek · 2021-06-09T10:20:54Z

m-alek
Jun 9, 2021

Scenario

I have a multi-label classification problem, where documents need to be assigned 1 or more classes. The specifics are:

200.000 already labeled documents;
the documents contain long textual data (ca. 1-5 pages long);
additionally, the documents have meta-data (mainly categorical);
there are 200 classes (in most cases documents are assigned 1 class; every now and then >= 2).

Question

What is the best practice if one wants to use document meta-data for feature enhancement?

Thoughts

For feature enhancement I was considering customizing the linear component of the "spacy.TextCatEnsemble.v1" architecture (https://spacy.io/api/architectures#TextCatEnsemble). The default linear model that this architecture uses is the bag-of-words model (https://spacy.io/api/architectures#TextCatBOW). I was thinking of concatinating a custom linear model to this bag-of-words model, where I will incorporate the meta-data of the document (using something of the like of https://github.com/explosion/spaCy/blob/master/spacy/ml/featureextractor.py).
I am also considering to include certain class-specific vocabulary or phrases as features in this custom linear model (that is, treating them as meta-data). Thereby hoping to emphasize the importance of this vocabulary. I do however anticipate some sort of colinearity because the feature would then be present in both the custom linear model and the BOW model. As a consequence I'm not entirely sure if this move is redundant, or if it allows to increase emphasis on these features (e.g. by presetting their weights).
Additionally, I was considering to split up the multi-label problem into binary classification problems. The motivation for this would be that many classes have very specific vocabulary, as well as distinctive meta-data that we know a priori are associated to them. Splitting up the problem would reduce complexity and noise.

Does anyone have any experience with feature enhancement in this context? Or any thoughts/feedback on my thoughts?

polm · 2021-06-10T06:56:13Z

polm
Jun 10, 2021

What is the best practice if one wants to use document meta-data for feature enhancement?

There is basically not a good way to do this now. #8194 has some workarounds, and #8204, #7610, #8187, #7790, #7236, #2253 are related.

2 replies

m-alek Jun 10, 2021
Author

Thank you polm for replying. Some threads (#8194, #8204) clarify to me how to add a custom extension to a doc object before passing it through the 'textcat' component. Effectively we are thereby adding a doc-level attribute.

By way of background: My use case is: I have PARTIAL information about documents. I wish to add this to the documents as attributes, to help 'nudge' the textcat model. This would be particularly useful for documents that have a 'weak' signal as to what category they belong into. The custom extenstions approach you suggest, seems to allow adding attributes.

My follow-up question: Does the textcat component then also use this added data, i.e. the doc-level attribute, when processing a document? I was under the impression that it was only using a tok2vec layer on the document text and combines that with a bag-of-words model (which to me seems to only use trigrams from the document text).

(#7790 seems to imply I have to customize this myself into the textcat component).

polm Jun 10, 2021

Yes, none of these approaches are standard so the default architectures don't deal with them. So instead of the standard textcat you'd have to subclass it or use something custom.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature enhancement in large multi-label classification problem #8314

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Feature enhancement in large multi-label classification problem #8314

Uh oh!

m-alek Jun 9, 2021

Scenario

Question

Thoughts

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

polm Jun 10, 2021

Uh oh!

m-alek Jun 10, 2021 Author

Uh oh!

polm Jun 10, 2021

m-alek
Jun 9, 2021

Replies: 1 comment 2 replies

polm
Jun 10, 2021

m-alek Jun 10, 2021
Author