Using spancat as a filter for NER #11831

karjudev · 2022-11-18T19:06:02Z

karjudev
Nov 18, 2022

Hi everybody, and thank you for the effort you put in creating this beautiful library.

My use case is peculiar: I have documents with lots of named entities, but only few of them are categorized as "sensitive", and have to be redacted. At the moment we have in production an online-learning system, where the model we trained (on a standard NER dataset) is actively corrected by human annotators, that most of the times remove false negatives. The model is periodically retrained, but the performances are quite disappointing and are increasing very slowly.

What we are thinking is to use a NER model to identify the spans of text, and then a spancat model to classify them into "sensitive"/"not sensitive".

Is it another possible use case of the SpanCategorizer? And if not, what can be another possible approach? Consider that we have very very low resources (about 120 documents up to now, but we are willing to reach 300 documents in about a month)

Thank you in advance if you will spend some time reading this and coming up with an answer.

Answered by polm

Nov 21, 2022

I imagine that you have many named entities of the same type, like Person, and only some of them are sensitive? That sounds like a high-level distinction that's very hard for NER to learn. I am not sure that running a spancat on the entities would work, but it should be easy enough to try.

Even if you weren't doing something difficult, 120 examples is a very small number. I would recommend you create artificial examples, either with data augmentation or (better) by writing some yourself.

View full answer

polm · 2022-11-21T04:00:06Z

polm
Nov 21, 2022

I imagine that you have many named entities of the same type, like Person, and only some of them are sensitive? That sounds like a high-level distinction that's very hard for NER to learn. I am not sure that running a spancat on the entities would work, but it should be easy enough to try.

Even if you weren't doing something difficult, 120 examples is a very small number. I would recommend you create artificial examples, either with data augmentation or (better) by writing some yourself.

1 reply

karjudev Nov 21, 2022
Author

I imagine that you have many named entities of the same type, like Person, and only some of them are sensitive?

Exactly what I mean. Of the many named entities we have, only some of them have to be classified as sensitive based on the person they are referring to. We didn't take into account co-reference resolution, since we suppose that the "sensitivity" can be inferred from the context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using spancat as a filter for NER #11831

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using spancat as a filter for NER #11831

Uh oh!

karjudev Nov 18, 2022

Replies: 1 comment · 1 reply

Uh oh!

polm Nov 21, 2022

Uh oh!

karjudev Nov 21, 2022 Author

karjudev
Nov 18, 2022

Replies: 1 comment 1 reply

polm
Nov 21, 2022

karjudev Nov 21, 2022
Author