NER for partially annotated documents #12239

rverdier65 · 2023-02-06T15:54:57Z

rverdier65
Feb 6, 2023

Hello !

We are using spacy to learn multiple tasks from a documents corpus:

Global document labelling (textcat_multilabel)
Document line labelling (spancat)
Named entity recognition (ner)

We have good results for the two first tasks. But we encounter some difficulties for the ner.

Problem definition

Let’s assume the entities are ENT1 and ENT2.

Our dataset is sparse and some documents are not fully labelled. For a given document we can have either:

ENT1 and ENT2 that has been reviewed and annotated (for example, if there is no annotated entries of ENT2 for this document, it means that it could not have one)
ENT1 or ENT2 (or both) that has not been reviewed and annotated (for example, if there is no annotated entries of ENT2 for this document, it means that it could have one, but this entity has not been reviewed for this document and we have this info)

We know for each document which entities has been reviewed and annotated.

We want to use all documents to learn to predict every entities.

Question

How could we prevent the model to learn not to predict ENT2 in a document where ENT2 has not been annotated, whereas it contains tokens corresponding to ENT2 ?
Constraint: we still want to use such a document to learn to predict ENT1.

Thank you very much for your help !

Answered by rverdier65

Feb 7, 2023

Hi @rmitsch, thank you for your answer.

Indeed I am refering to the document.
I meant to say "if there is no annotated entries of ENT2 for this document"

What I mean with 'reviewed and annotated' is the following:
For a given entity, for example ENT2, we consider that the document has been 'reviewed and annotated' by a user, if the user have exhaustively annotated all the occurences of ENT2 in the document.
There is 2 possibilities:

The document contains some tokens corresponding to ENT2, so all of these tokens have been annotated by the user
The document does not contain any token corresponding to ENT2, so the user just flagued that this document has been reviewed for ENT2 but there is …

View full answer

rmitsch · 2023-02-07T10:21:51Z

rmitsch
Feb 7, 2023
Maintainer

Hi @rverdier65,

ENT1 and ENT2 that has been reviewed and annotated (for example, if there is no occurence of ENT2 in the dataset, it means that it could not have one)

You say "For a given document", but refer to a "dataset" in this sentence. Did you mean to say "document", i. e. "if there is no occurence of ENT2 in the document"?

In general, could you elaborate on what you mean with "reviewed"?

0 replies

rverdier65 · 2023-02-07T11:59:44Z

rverdier65
Feb 7, 2023
Author

Hi @rmitsch, thank you for your answer.

Indeed I am refering to the document.
I meant to say "if there is no annotated entries of ENT2 for this document"

What I mean with 'reviewed and annotated' is the following:
For a given entity, for example ENT2, we consider that the document has been 'reviewed and annotated' by a user, if the user have exhaustively annotated all the occurences of ENT2 in the document.
There is 2 possibilities:

The document contains some tokens corresponding to ENT2, so all of these tokens have been annotated by the user
The document does not contain any token corresponding to ENT2, so the user just flagued that this document has been reviewed for ENT2 but there is nothing to annotate

So for each document, we know which entity has been 'reviewed and annotated'. Some documents can have only ENT1 'reviewed an annotated', others only ENT2, other both and other none of them.

We want to use all documents, even if they are 'partially annotated' = not all the entities have been 'reviewed an annotated' in the document.

Is it clearer ?

10 replies

rverdier65 Feb 7, 2023
Author

Thank you for the advice !
Unfortunatly we have this problem for a lot of entities so we can not afford to train one model by entity.

Don't you have other advices ?

We tried to patch the first backprop here, setting to 0 the gradient of the non annotated entities. It improved a lot the results we had but it is not sufficient. And now we have a lot of difficulties to patch this one.
We are thinking about testing a new approch, using spancat component with Begin and Inside label for each entity (B-ENT1, I-ENT1, B-ENT2, I-ENT2), and then patch the loss here which seems easier.

What do you think about this ?

rmitsch Feb 8, 2023
Maintainer

To back up a bit here, I think this is probably not the best direction to approach this problem from.

Is it possible for you to complete annotations for your set of documents? Hacking around with gradients is a possibility, but it's untested and runs counter to the design goals for the NER architecture. I'd recommend not do do it, if it's in any way avoidable.

rverdier65 Feb 8, 2023
Author

Thank you for you suggestion but no, it is not possible to complete the annotations because our model need to be able to learn new tasks quickly without annotated all the documents for these new tasks. And also because it would be a looot of documents to annotate.
So, we will always have some entities not annotated on all documents.

We made a poc for the second option and it is very encouraging.

rmitsch Feb 8, 2023
Maintainer

Glad to hear that your POC is going well 🙂 Feel free to post more details here if you want to share your findings with rest of the community!

rverdier65 Feb 8, 2023
Author

Thank you for your answers ! 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NER for partially annotated documents #12239

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 10 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

NER for partially annotated documents #12239

Uh oh!

Uh oh!

rverdier65 Feb 6, 2023

Problem definition

Question

Replies: 2 comments · 10 replies

Uh oh!

rmitsch Feb 7, 2023 Maintainer

Uh oh!

rverdier65 Feb 7, 2023 Author

Uh oh!

rverdier65 Feb 7, 2023 Author

Uh oh!

rmitsch Feb 8, 2023 Maintainer

Uh oh!

rverdier65 Feb 8, 2023 Author

Uh oh!

rmitsch Feb 8, 2023 Maintainer

Uh oh!

rverdier65 Feb 8, 2023 Author

rverdier65
Feb 6, 2023

Replies: 2 comments 10 replies

rmitsch
Feb 7, 2023
Maintainer

rverdier65
Feb 7, 2023
Author

rverdier65 Feb 7, 2023
Author

rmitsch Feb 8, 2023
Maintainer

rverdier65 Feb 8, 2023
Author

rmitsch Feb 8, 2023
Maintainer

rverdier65 Feb 8, 2023
Author