Using spancat as a filter for NER #11831
-
Hi everybody, and thank you for the effort you put in creating this beautiful library. My use case is peculiar: I have documents with lots of named entities, but only few of them are categorized as "sensitive", and have to be redacted. At the moment we have in production an online-learning system, where the model we trained (on a standard NER dataset) is actively corrected by human annotators, that most of the times remove false negatives. The model is periodically retrained, but the performances are quite disappointing and are increasing very slowly. What we are thinking is to use a NER model to identify the spans of text, and then a spancat model to classify them into "sensitive"/"not sensitive". Is it another possible use case of the Thank you in advance if you will spend some time reading this and coming up with an answer. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I imagine that you have many named entities of the same type, like Person, and only some of them are sensitive? That sounds like a high-level distinction that's very hard for NER to learn. I am not sure that running a spancat on the entities would work, but it should be easy enough to try. Even if you weren't doing something difficult, 120 examples is a very small number. I would recommend you create artificial examples, either with data augmentation or (better) by writing some yourself. |
Beta Was this translation helpful? Give feedback.
I imagine that you have many named entities of the same type, like Person, and only some of them are sensitive? That sounds like a high-level distinction that's very hard for NER to learn. I am not sure that running a spancat on the entities would work, but it should be easy enough to try.
Even if you weren't doing something difficult, 120 examples is a very small number. I would recommend you create artificial examples, either with data augmentation or (better) by writing some yourself.