Using multiple SpanCat models in one pipeline #12462

b2m · 2023-03-23T20:01:54Z

b2m
Mar 23, 2023

So I went down a rabbit hole and I am not sure whether I took a wrong turn, or whether I should continue.

We are trying to train a SpanCat model to detect data regarding persons (first name, last name, birth name, title, ...) in semi structured data. Our first experiments looked quite promising, but as we continued annotating, we started to get imbalanced classes and the model stopped learning some underrepresented classes.

To give a concrete example: all persons do have a last name, but only some persons to have a title. So even when we start selecting only examples containing persons with titles for annotating, the class imbalance will stay or even continue to grow, as each example may have several persons in it (one with a title, one or more without).

We start with the following pipeline:

flowchart LR
    text["Text"] --> tok2vec
    spancat --> entities["Entities"]

    subgraph pipeline["Pipeline"]
        direction LR
        tok2vec["Tok2Vec"] --> spancat["SpanCat"]
    end

Turning Point 1: rules do not work in our use case

Using rules for detecting the underrepresented classes is not satisfying, as the structure of the data changes quite often. 🐰

Turning point 2: use two separate models.

We then started to split the training data by labels and extracted the underrepresented classes to be trained in a separate model.

This worked quite well. 🥳

Turning point 3: combine the two models in one pipeline

We sourced the two separately trained models into one pipeline.

flowchart LR
    text["Text"] ---> tok2vec
    spancat2 --> entities["Entities"]

    subgraph pipeline["Pipeline"]
        direction LR
        tok2vec["Tok2Vec"] --> spancat1["SpanCat1"] --> spancat2["SpanCat2"]
    end

    subgraph pipeline_spancat1["Pipeline SpanCat1"]
        spancat1_tok2vec["Tok2Vec"] --> component_spancat1["Spancat1"]
    end

    subgraph pipeline_spancat2["Pipeline SpanCat2"]
        spancat2_tok2vec["Tok2Vec"] --> component_spancat2["Spancat2"]
    end

    component_spancat1 -.source-.-> spancat1
    component_spancat2 -.source-.-> spancat2

We soon learned, that for our use case we have to embed the Tok2Vec layer into each model and omit it in the merged pipeline. 🤓

flowchart LR
    text["Text"] ---> spancat1
    spancat2 --> entities["Entities"]

    subgraph pipeline["Pipeline"]
        direction LR
        spancat1["SpanCat1"] --> spancat2["SpanCat2"]
    end

    subgraph pipeline_spancat1["Pipeline SpanCat1"]
        subgraph component_spancat1["Spancat1"]
            spancat1_tok2vec["Tok2Vec"]
        end
    end

    subgraph pipeline_spancat2["Pipeline SpanCat2"]
        subgraph component_spancat2["Spancat2"]
            spancat2_tok2vec["Tok2Vec"]
        end
    end

    component_spancat1 -.source-.-> spancat1
    component_spancat2 -.source-.-> spancat2

Turning point 4: use separate keys

The next lesson we learned is, that if we use multiple SpanCat models in one pipeline, the later model will just overwrite all the predictions from the previous model. As this is quite unnecessary we where quite surprised and we found no obvious way in either documentation or code to configure this behavior. 😮

Therefore we started to use separate spans keys for each model (sc_spancat1 and sc_spancat2), and added a custom component at the end of the pipeline to merge the model predictions back to sc.

flowchart LR
    text["Text"] ---> spancat1 
    custom_component --> entities["Entities"]

    subgraph pipeline["Pipeline"]
        direction LR
        spancat1["SpanCat1"] --> spancat2["SpanCat2"]
        spancat2 --> custom_component
        subgraph custom_component["Custom Component"]
            sc
        end
    end

    subgraph pipeline_spancat1["Pipeline SpanCat1"]
        subgraph component_spancat1["Spancat1"]
            direction LR
            spancat1_tok2vec["Tok2Vec"]
            sc_spancat1
        end
    end

    subgraph pipeline_spancat2["Pipeline SpanCat2"]
        subgraph component_spancat2["Spancat2"]
            direction LR
            spancat2_tok2vec["Tok2Vec"]
            sc_spancat2
        end
    end

    sc_spancat1 -.merge-.-> sc
    sc_spancat2 -.merge-.-> sc
    component_spancat1 -.source-.-> spancat1
    component_spancat2 -.source-.-> spancat2

Turning point 5: how to train SpanCat models with custom keys

When using custom keys for our SpanCat models, spacy train complains with

ValueError: [E143] Labels for component 'spancat' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

We tried setting the spans_key in the merged pipeline configuration instead, but this has apparently no effect.

There is no obvious way to tell either prodigy train, prodigy data-to-spacy, or spacy train what custom spans key the model uses. 😓

My current workaround is to set a custom spans_key in the models config file after it was trained.

But I start wondering, whether I took a wrong turn or missed some configuration option, as this seems quite a bit of oberload just to use two SpanCat models in one pipeline.

Answered by adrianeboyd

Mar 28, 2023

Sorry, this does sound a lot more frustrating than we'd like it to be!

We just updated the docs in #12464 to clarify that doc.spans[spans_key] gets overwritten by spancat . I think there are potentially a lot of edge cases related to merging span groups, so we'd rather leave combining spans up to the user.

It sounds like one of the issues is that that prodigy always uses the default key sc. I only checked briefly, but it doesn't look like it's possible to configure this with prodigy data-to-spacy. It might make sense to add this as a feature request for prodigy.

On the spacy train side of things it's possible to configure a custom spans key, but getting all the config details correct is a…

View full answer

adrianeboyd · 2023-03-28T12:55:34Z

adrianeboyd
Mar 28, 2023

Sorry, this does sound a lot more frustrating than we'd like it to be!

We just updated the docs in #12464 to clarify that doc.spans[spans_key] gets overwritten by spancat . I think there are potentially a lot of edge cases related to merging span groups, so we'd rather leave combining spans up to the user.

It sounds like one of the issues is that that prodigy always uses the default key sc. I only checked briefly, but it doesn't look like it's possible to configure this with prodigy data-to-spacy. It might make sense to add this as a feature request for prodigy.

On the spacy train side of things it's possible to configure a custom spans key, but getting all the config details correct is also trickier than it should be.

I can give some suggestions depending on how you want to approach this, but there's not an extremely easy alternative that you overlooked, and you've figured out most of the details above.

If you already have two trained spancat components in separate pipelines that use the default key sc, I would recommend:

Rename the spans_key in the saved pipeline directories by editing both config.cfg and spancat/cfg. (Only spancat/cfg is going to affect the output, but if you don't edit both you will probably be confused when you look back in the future. The plan for spacy v4 is to have this setting only in the config.)

Use replace_listeners when assembling the final pipeline so you don't have to deal with multiple tok2vec components.

[components.spancat1]
source = "/path/to/sc1_training/model-best"
component = "spancat"
replace_listeners = ["model.tok2vec"]

[components.spancat2]
source = "/path/to/sc2_training/model-best"
component = "spancat"
replace_listeners = ["model.tok2vec"]

# however you do this part with a custom component
[components.merge_spans_keys]
factory = "merge_spans_keys"

For step 5, the issue is that the saved training data is still using the default spans key so it doesn't find the annotation. You can copy or rename the spans key before training instead. A sketch of copying the spans (this just copies and doesn't delete the original spans key):

import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
doc_bin = DocBin()
for doc in DocBin().from_disk("/tmp/dataset1-sc.spacy").get_docs(nlp.vocab):
    doc.spans["sc1"] = doc.spans["sc"]
    doc_bin.add(doc)
doc_bin.to_disk("/tmp/dataset1-sc1.spacy")

For spacy train with a custom spans key you need to edit [components.spancat.spans_key] and also the [training.score_weights] entries to use the modified spans key instead of sc (spans_sc1_f = 1.0).

An underlying problem is that the default configs don't support templating, so managing custom spans keys is more difficult than it should be and it's hard to get spacy init config -p spancat or nlp.add_pipe("spancat") to handle all the relevant settings in the config. (We've kind of made this too configurable without good tools to manage the details in the background.)

2 replies

b2m Mar 28, 2023
Author

Dear Adriane;

thank you for your explicit response!

Sorry, this does sound a lot more frustrating than we'd like it to be!

That was what kind of made me doubt the most... maybe getting too spoiled by spaCy's feel good/batteries included/everything well documented standard =)

Thank you for already having taken the steps to update the documentation!

I will propose a feature request for prodigy data-to-spacy in prodigy's support forum.

Also thank you for the hints on how to configure the training for a custom spans key and how to rewrite the training data.
That was the missing piece in my puzzle...

I think I will stay with rewriting the configuration after training the separate models.
With that I have a consistent system for model1, model2 and the merged pipeline.

Cheers Benjamin

b2m Mar 28, 2023
Author

This is the link to the feature request in the support forum for prodigy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using multiple SpanCat models in one pipeline #12462

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using multiple SpanCat models in one pipeline #12462

Uh oh!

b2m Mar 23, 2023

Turning Point 1: rules do not work in our use case

Turning point 2: use two separate models.

Turning point 3: combine the two models in one pipeline

Turning point 4: use separate keys

Turning point 5: how to train SpanCat models with custom keys

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Mar 28, 2023

Uh oh!

b2m Mar 28, 2023 Author

Uh oh!

b2m Mar 28, 2023 Author

b2m
Mar 23, 2023

Replies: 1 comment 2 replies

adrianeboyd
Mar 28, 2023

b2m Mar 28, 2023
Author

b2m Mar 28, 2023
Author