How to train a TextCategorizer using the entities matched by NER or EntityRuler? #10470

shannon-alliance · 2022-03-09T16:25:17Z

shannon-alliance
Mar 9, 2022

I'm trying to understand how to classify a document based on named entities found by earlier pipeline components rather than just the raw text.

Say I have the document
"Gross Pay $50. Net Pay $40. Tax $10"

I want to classify the whole text as a PAYSLIP in a multilabel textcat.

With some custom EntityRuler patterns I can easily predict the document entity labels as something like:
"Gross Pay [MONEY]. Net Pay [MONEY]. Tax [MONEY]"

My question is, how do I use these labels (stored in doc.ents / Token.ent_type) as features to train a TextCategorizer so it only cares whether a token is MONEY and doesn't distinguish between the different quantities ($50, $40, $10) when predicting a category? ie, how do I classify documents based on token.ent_type and not token.text for all or some of the documents' tokens?

I'm using spaCy 3.2

Answered by polm

Mar 11, 2022

This came up in #9776 last year - that thread's long and has a lot of unrelated stuff so to just pull out the relevant parts...

There is no easy way to do this. While using named entities as features for document classification is done sometimes, it's not very common. In particular, if you're just matching literal strings it probably doesn't provide much over what the text classifier would learn itself, since it will already learn values for all the words it sees.

Some of the ways you could use the entities would be:

Use the Entity Ruler to label NER data and train an NER component together with a textcat component using a tok2vec listener. The components will be forced to share a tok2v…

View full answer

polm · 2022-03-11T03:47:22Z

polm
Mar 11, 2022

This came up in #9776 last year - that thread's long and has a lot of unrelated stuff so to just pull out the relevant parts...

There is no easy way to do this. While using named entities as features for document classification is done sometimes, it's not very common. In particular, if you're just matching literal strings it probably doesn't provide much over what the text classifier would learn itself, since it will already learn values for all the words it sees.

Some of the ways you could use the entities would be:

Use the Entity Ruler to label NER data and train an NER component together with a textcat component using a tok2vec listener. The components will be forced to share a tok2vec representation so they'll indirectly influence each other.
Create a custom tok2vec architecture that's like the existing ones but adds in entity embeddings somehow. This could work similar to how word embeddings are incorporated. This would be pretty involved.

Since I saw another question about it recently (maybe on Stack Overflow), I spent a little time searching for papers, blog posts, or other reports of using named entities as features for text classification, and I found less than I expected.

This paper from 2017 on the topic was intended to evaluate whether NE features were helpful and concluded they weren't, but the methods they use all seem to be very old - SVM, KNN, Naive Bayes, etc.

https://api.semanticscholar.org/CorpusID:54774880

This paper from this year suggested they were helpful, but it seems like a very limited circumstance, and again is using very old classification methods - no word vectors or LSTMs here, it's SVMs, tf-idf, etc.

https://api.semanticscholar.org/CorpusID:238744574

It's also hard to find this topic because NER is sometimes considered to have a subtask of "Named Entity Classification", and most of the search results are documents about that.

(No longer pulling from old post)

To add an extra point, note that spaCy's tok2vec internally uses "word shape" and other features (depending on how you configure it) that learn general properties of the tokens and not just the literal token. See here in the docs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to train a TextCategorizer using the entities matched by NER or EntityRuler? #10470

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to train a TextCategorizer using the entities matched by NER or EntityRuler? #10470

Uh oh!

Uh oh!

shannon-alliance Mar 9, 2022

Replies: 1 comment

Uh oh!

polm Mar 11, 2022

shannon-alliance
Mar 9, 2022

polm
Mar 11, 2022