Training NER and TextCategorizer together #12991

darioprencipe · 2023-09-19T17:45:59Z

darioprencipe
Sep 19, 2023

Hello,

I have a dataset of raw bank transactions. My task - out of a raw bank transaction description like PAGAMENTO ADUE COD. DISP.:0122061307830975 NOME:GOOGLE IRELAND LIMITED - MANDATO:26941599- is extracting the following info:

counterpart >> GOOGLE IRELAND LIMITED (NER task)
reason >> MANDATO:26941599 (NER task)
type >> DIRECT_DEBIT (textcat task)

I have already trained from scratch a model with ner (and tok2vec) component to extract counterpart entity. Setup of the training and test datasets in .spacy format and training task using spacy CLI and .cfg file was pretty straightforward.

What I don't get now is - provided I do have the raw annotations (let's say in a .csv file) - how I should go about training ner AND textcat together in the same blank sheet model? Shall I:

build single train.spacy and dev.spacy files using a blank sheet model, where I do assign both entities and categories at the same time, and then running one single training task using .cfg to declare which components I want to train?
OR build distinct train/test dataset couples and run 2 distinct training tasks assembling the models together afterward?

The output I imagine is a custom spacy model that allows me to do the following:

import spacy

nlp = spacy.load("training/model-best")

doc = nlp("PAGAMENTO ADUE COD. DISP.:0122061307830975 NOME:GOOGLE IRELAND LIMITED - MANDATO:26941599")

doc.ents
# prints me the 2 entities ("COUNTERPART": "GOOGLE IRELAND LIMITED", "REASON": "MANDATO:26941599")

doc.cats
# prints me the estimated category => DIRECT_DEBIT

In general, I couldn't find examples of pipeline where you or folks around do exactly this. So I wonder what's the best-practice-way to approach this problem.

Thanks a lot!
Dario

Answered by danieldk

Sep 22, 2023

If you have both the entity annotations and the text category in the same data (which you seem to have), then it is best to treat this as as single training task. This will allow both the named entity recognizer and text categorizer to use the same underlying contextual vector (tok2vec representations).

Training two separate pipelines and merging them is more of a last resort when the data sets are disjoint. E.g. if you had one document collection with text categories and a completely different document collection with named entity annotations.

View full answer

danieldk · 2023-09-22T10:49:08Z

danieldk
Sep 22, 2023

If you have both the entity annotations and the text category in the same data (which you seem to have), then it is best to treat this as as single training task. This will allow both the named entity recognizer and text categorizer to use the same underlying contextual vector (tok2vec representations).

Training two separate pipelines and merging them is more of a last resort when the data sets are disjoint. E.g. if you had one document collection with text categories and a completely different document collection with named entity annotations.

6 replies

danieldk Sep 25, 2023

question: assume I have everything in the same .csv. At some point I will have to convert this into .spacy format. Is there any specific data format I should comply with in order to be able to pass both entities and categories in the same conversion loop?

No, you can just create Docs with the contents and set the NER and category annotations.

question: is there any config you would recommend for training together a NER and textCat on a dataset of semantically-poor text records? The .cfg file I've been using for the NER task so far (with good results) is the following:

If you have a very large number of records (they do not need to be annotated), you could consider using spacy pretrain to pretrain domain-specific vectors: https://spacy.io/usage/embeddings-transformers/#pretraining

I think you could also get some mileage out of inspecting the output of the tokenizer and seeing if you could add some specific rules to improve it. In the example that you provided the tokens were:

In [6]: [token.text for token in doc]
Out[6]: 
['PAGAMENTO',
 'ADUE',
 'COD',
 '.',
 'DISP.:0122061307830975',
 'NOME',
 ':',
 'GOOGLE',
 'IRELAND',
 'LIMITED',
 '-',
 'MANDATO:26941599'

I am not familiar with the data/domain, but I could imagine that if e.g. _ MANDATO_ is a good indicator for class, then the tokenization here would not be optimal. The data could contain multiple _ MANDATO:_ and you'd run into data sparseness issues. Whereas if this was tokenized into e.g. __ MANDATO : 26941599_, the model could pick up MANDATO as a good feature across documents.

darioprencipe Sep 26, 2023
Author

@danieldk thanks a lot - that definitely helps. I don't get whether I fully understood your point on tokens =>

I could imagine that if e.g. _ MANDATO_ is a good indicator for class

Correct, it is. While indeed tokenizer currently doesn't split that >> 'MANDATO:26941599'. However, then you say:

Whereas if this was tokenized into e.g. __ MANDATO : 26941599_, the model could pick up MANDATO as a good feature across documents.

Isn't this exactly how spaCy tokenized it? Following your point I would have expected the desired token here was MANDATO:, as it's a common, repeated pattern in the documents and it indeed helps predicting a category label.

danieldk Sep 26, 2023

Sorry for the unclarity. By convention I put spaces between the tokens in MANDATO : 26941599 to signify that in a better tokenization they'd be separate tokens.

darioprencipe Oct 20, 2023
Author

@danieldk sorry to bother again, just wondering - how and when should I amend the tokenizer to include, e.g. rule that always splits tokens whenever ":" is found, if I do pre-training on my own corpus?

Shall I do it right before building the annotated train.spacy and dev.spacy files, as I start from a blank nlp object?

Thanks!

rmitsch Oct 30, 2023

Your corpus should be tokenized consistently, i. e. training and inference data should use the same tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training NER and TextCategorizer together #12991

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training NER and TextCategorizer together #12991

Uh oh!

darioprencipe Sep 19, 2023

Replies: 1 comment · 6 replies

Uh oh!

danieldk Sep 22, 2023

Uh oh!

danieldk Sep 25, 2023

Uh oh!

darioprencipe Sep 26, 2023 Author

Uh oh!

Uh oh!

danieldk Sep 26, 2023

Uh oh!

darioprencipe Oct 20, 2023 Author

Uh oh!

rmitsch Oct 30, 2023

darioprencipe
Sep 19, 2023

Replies: 1 comment 6 replies

danieldk
Sep 22, 2023

darioprencipe Sep 26, 2023
Author

darioprencipe Oct 20, 2023
Author