Feature enhancement in large multi-label classification problem #8314
m-alek
started this conversation in
Help: Best practices
Replies: 1 comment 2 replies
-
There is basically not a good way to do this now. #8194 has some workarounds, and #8204, #7610, #8187, #7790, #7236, #2253 are related. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Scenario
I have a multi-label classification problem, where documents need to be assigned 1 or more classes. The specifics are:
Question
What is the best practice if one wants to use document meta-data for feature enhancement?
Thoughts
For feature enhancement I was considering customizing the linear component of the "spacy.TextCatEnsemble.v1" architecture (https://spacy.io/api/architectures#TextCatEnsemble). The default linear model that this architecture uses is the bag-of-words model (https://spacy.io/api/architectures#TextCatBOW). I was thinking of concatinating a custom linear model to this bag-of-words model, where I will incorporate the meta-data of the document (using something of the like of https://github.com/explosion/spaCy/blob/master/spacy/ml/featureextractor.py).
I am also considering to include certain class-specific vocabulary or phrases as features in this custom linear model (that is, treating them as meta-data). Thereby hoping to emphasize the importance of this vocabulary. I do however anticipate some sort of colinearity because the feature would then be present in both the custom linear model and the BOW model. As a consequence I'm not entirely sure if this move is redundant, or if it allows to increase emphasis on these features (e.g. by presetting their weights).
Additionally, I was considering to split up the multi-label problem into binary classification problems. The motivation for this would be that many classes have very specific vocabulary, as well as distinctive meta-data that we know a priori are associated to them. Splitting up the problem would reduce complexity and noise.
Does anyone have any experience with feature enhancement in this context? Or any thoughts/feedback on my thoughts?
Beta Was this translation helpful? Give feedback.
All reactions