Using multiple SpanCat models in one pipeline #12462
-
So I went down a rabbit hole and I am not sure whether I took a wrong turn, or whether I should continue. We are trying to train a SpanCat model to detect data regarding persons (first name, last name, birth name, title, ...) in semi structured data. Our first experiments looked quite promising, but as we continued annotating, we started to get imbalanced classes and the model stopped learning some underrepresented classes. To give a concrete example: all persons do have a last name, but only some persons to have a title. So even when we start selecting only examples containing persons with titles for annotating, the class imbalance will stay or even continue to grow, as each example may have several persons in it (one with a title, one or more without). We start with the following pipeline: flowchart LR
text["Text"] --> tok2vec
spancat --> entities["Entities"]
subgraph pipeline["Pipeline"]
direction LR
tok2vec["Tok2Vec"] --> spancat["SpanCat"]
end
Turning Point 1: rules do not work in our use caseUsing rules for detecting the underrepresented classes is not satisfying, as the structure of the data changes quite often. 🐰 Turning point 2: use two separate models.We then started to split the training data by labels and extracted the underrepresented classes to be trained in a separate model. This worked quite well. 🥳 Turning point 3: combine the two models in one pipelineWe sourced the two separately trained models into one pipeline. flowchart LR
text["Text"] ---> tok2vec
spancat2 --> entities["Entities"]
subgraph pipeline["Pipeline"]
direction LR
tok2vec["Tok2Vec"] --> spancat1["SpanCat1"] --> spancat2["SpanCat2"]
end
subgraph pipeline_spancat1["Pipeline SpanCat1"]
spancat1_tok2vec["Tok2Vec"] --> component_spancat1["Spancat1"]
end
subgraph pipeline_spancat2["Pipeline SpanCat2"]
spancat2_tok2vec["Tok2Vec"] --> component_spancat2["Spancat2"]
end
component_spancat1 -.source-.-> spancat1
component_spancat2 -.source-.-> spancat2
We soon learned, that for our use case we have to embed the Tok2Vec layer into each model and omit it in the merged pipeline. 🤓 flowchart LR
text["Text"] ---> spancat1
spancat2 --> entities["Entities"]
subgraph pipeline["Pipeline"]
direction LR
spancat1["SpanCat1"] --> spancat2["SpanCat2"]
end
subgraph pipeline_spancat1["Pipeline SpanCat1"]
subgraph component_spancat1["Spancat1"]
spancat1_tok2vec["Tok2Vec"]
end
end
subgraph pipeline_spancat2["Pipeline SpanCat2"]
subgraph component_spancat2["Spancat2"]
spancat2_tok2vec["Tok2Vec"]
end
end
component_spancat1 -.source-.-> spancat1
component_spancat2 -.source-.-> spancat2
Turning point 4: use separate keysThe next lesson we learned is, that if we use multiple SpanCat models in one pipeline, the later model will just overwrite all the predictions from the previous model. As this is quite unnecessary we where quite surprised and we found no obvious way in either documentation or code to configure this behavior. 😮 Therefore we started to use separate spans keys for each model ( flowchart LR
text["Text"] ---> spancat1
custom_component --> entities["Entities"]
subgraph pipeline["Pipeline"]
direction LR
spancat1["SpanCat1"] --> spancat2["SpanCat2"]
spancat2 --> custom_component
subgraph custom_component["Custom Component"]
sc
end
end
subgraph pipeline_spancat1["Pipeline SpanCat1"]
subgraph component_spancat1["Spancat1"]
direction LR
spancat1_tok2vec["Tok2Vec"]
sc_spancat1
end
end
subgraph pipeline_spancat2["Pipeline SpanCat2"]
subgraph component_spancat2["Spancat2"]
direction LR
spancat2_tok2vec["Tok2Vec"]
sc_spancat2
end
end
sc_spancat1 -.merge-.-> sc
sc_spancat2 -.merge-.-> sc
component_spancat1 -.source-.-> spancat1
component_spancat2 -.source-.-> spancat2
Turning point 5: how to train SpanCat models with custom keysWhen using custom keys for our SpanCat models,
We tried setting the There is no obvious way to tell either My current workaround is to set a custom But I start wondering, whether I took a wrong turn or missed some configuration option, as this seems quite a bit of oberload just to use two SpanCat models in one pipeline. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Sorry, this does sound a lot more frustrating than we'd like it to be! We just updated the docs in #12464 to clarify that It sounds like one of the issues is that that prodigy always uses the default key On the I can give some suggestions depending on how you want to approach this, but there's not an extremely easy alternative that you overlooked, and you've figured out most of the details above. If you already have two trained
For step 5, the issue is that the saved training data is still using the default spans key so it doesn't find the annotation. You can copy or rename the spans key before training instead. A sketch of copying the spans (this just copies and doesn't delete the original spans key): import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
doc_bin = DocBin()
for doc in DocBin().from_disk("/tmp/dataset1-sc.spacy").get_docs(nlp.vocab):
doc.spans["sc1"] = doc.spans["sc"]
doc_bin.add(doc)
doc_bin.to_disk("/tmp/dataset1-sc1.spacy") For An underlying problem is that the default configs don't support templating, so managing custom spans keys is more difficult than it should be and it's hard to get |
Beta Was this translation helpful? Give feedback.
Sorry, this does sound a lot more frustrating than we'd like it to be!
We just updated the docs in #12464 to clarify that
doc.spans[spans_key]
gets overwritten byspancat
. I think there are potentially a lot of edge cases related to merging span groups, so we'd rather leave combining spans up to the user.It sounds like one of the issues is that that prodigy always uses the default key
sc
. I only checked briefly, but it doesn't look like it's possible to configure this withprodigy data-to-spacy
. It might make sense to add this as a feature request for prodigy.On the
spacy train
side of things it's possible to configure a custom spans key, but getting all the config details correct is a…