How to train a TextCategorizer using the entities matched by NER or EntityRuler? #10470
-
I'm trying to understand how to classify a document based on named entities found by earlier pipeline components rather than just the raw text. Say I have the document I want to classify the whole text as a PAYSLIP in a multilabel textcat. With some custom EntityRuler patterns I can easily predict the document entity labels as something like: My question is, how do I use these labels (stored in doc.ents / Token.ent_type) as features to train a TextCategorizer so it only cares whether a token is MONEY and doesn't distinguish between the different quantities ($50, $40, $10) when predicting a category? ie, how do I classify documents based on token.ent_type and not token.text for all or some of the documents' tokens? I'm using spaCy 3.2 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This came up in #9776 last year - that thread's long and has a lot of unrelated stuff so to just pull out the relevant parts... There is no easy way to do this. While using named entities as features for document classification is done sometimes, it's not very common. In particular, if you're just matching literal strings it probably doesn't provide much over what the text classifier would learn itself, since it will already learn values for all the words it sees. Some of the ways you could use the entities would be:
Since I saw another question about it recently (maybe on Stack Overflow), I spent a little time searching for papers, blog posts, or other reports of using named entities as features for text classification, and I found less than I expected. This paper from 2017 on the topic was intended to evaluate whether NE features were helpful and concluded they weren't, but the methods they use all seem to be very old - SVM, KNN, Naive Bayes, etc. https://api.semanticscholar.org/CorpusID:54774880 This paper from this year suggested they were helpful, but it seems like a very limited circumstance, and again is using very old classification methods - no word vectors or LSTMs here, it's SVMs, tf-idf, etc. https://api.semanticscholar.org/CorpusID:238744574 It's also hard to find this topic because NER is sometimes considered to have a subtask of "Named Entity Classification", and most of the search results are documents about that. (No longer pulling from old post) To add an extra point, note that spaCy's tok2vec internally uses "word shape" and other features (depending on how you configure it) that learn general properties of the tokens and not just the literal token. See here in the docs. |
Beta Was this translation helpful? Give feedback.
This came up in #9776 last year - that thread's long and has a lot of unrelated stuff so to just pull out the relevant parts...
There is no easy way to do this. While using named entities as features for document classification is done sometimes, it's not very common. In particular, if you're just matching literal strings it probably doesn't provide much over what the text classifier would learn itself, since it will already learn values for all the words it sees.
Some of the ways you could use the entities would be: