spacy train from prodigy data-to-spacy config with en_core_web_trf yields ValueError: Cannot deserialize model: mismatched structure #11849
-
I'm training a How to reproduce the behaviour (from a notebook):from spacy.cli.train import train
out_path = "./model/out/path"
!prodigy data-to-spacy $out_path --textcat-multilabel dataset_name --base-model en_core_web_trf --eval-split 0.2
train(f"{out_path}/config.cfg",
out_path,
use_gpu = 1,
overrides={"paths.train" : f"{out_path}train.spacy",
"paths.dev" : f"{out_path}dev.spacy"
}
)
nlp = spacy.load(out_path)
>>> ValueError: Cannot deserialize model: mismatched structure On the basis of this advice, this error can be worked around in either of two ways:
However, neither of these are ideal. For instance, getting these models to work with spacy-report requires editing the source code of that package. It seems there's still a bug relating to the frozen components in the call to Here is an example observation from
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Let me first back up a step: your goal is to have a final pipeline with everything from If that's the case, then you can train the textcat model separately and "assemble" the final pipeline as the last step. It would look like this: prodigy data-to-spacy out/ --textcat-multilabel dataset_name --eval-split 0.2
spacy train out/config.cfg --paths.train out/train.spacy --paths.dev out/dev.spacy -o training/ And then assemble. You can write a config to do with import spacy
nlp = spacy.load("en_core_web_trf")
tcm_nlp = spacy.load("training/model-best")
tcm_nlp.replace_listeners("tok2vec", "textcat_multilabel", ["model.tok2vec"])
nlp.add_pipe("textcat_multilabel", source=tcm_nlp)
nlp.to_disk("/path/to/my_combined_pipeline") In addition, the prodigy config defaults that you get with [components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3 (Ideally you'd be able to generate this with |
Beta Was this translation helpful? Give feedback.
-
No, that's not what I want to do. What I was trying to do in the above code was simply re-produce the Perhaps this conversion would be better suited to As an aside, given what you've mentioned above, what is the difference between calls to Additionally, the code you suggest doesn't work. It fails with: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [13], in <cell line: 5>()
2 nlp = spacy.load("en_core_web_trf")
3 tcm_nlp = spacy.load(f"./{EXP_NAME}/model/model-best")
----> 5 tcm_nlp.replace_listeners("tok2vec", "textcat_multilabel", ["model.tok2vec"])
6 nlp.add_pipe("textcat_multilabel", source=tcm_nlp)
7 nlp.to_disk(f"./{EXP_NAME}/test.cfg")
File ~/anaconda3/envs/prodigy/lib/python3.9/site-packages/spacy/language.py:1969, in Language.replace_listeners(self, tok2vec_name, pipe_name, listeners)
1962 if tok2vec_name not in self.pipe_names:
1963 err = Errors.E889.format(
1964 tok2vec=tok2vec_name,
1965 name=pipe_name,
1966 unknown=tok2vec_name,
1967 opts=", ".join(self.pipe_names),
1968 )
-> 1969 raise ValueError(err)
1970 if pipe_name not in self.pipe_names:
1971 err = Errors.E889.format(
1972 tok2vec=tok2vec_name,
1973 name=pipe_name,
1974 unknown=pipe_name,
1975 opts=", ".join(self.pipe_names),
1976 )
ValueError: [E889] Can't replace 'tok2vec' listeners of component 'textcat_multilabel' because 'tok2vec' is not in the pipeline. Available components: textcat_multilabel. If you didn't call nlp.replace_listeners manually, this is likely a bug in spaCy. |
Beta Was this translation helpful? Give feedback.
-
I checked and using
If you want to experiment with other textcat architectures, I'd recommend exporting your data from prodigy and training the textcat model separately with spacy. You can generate a transformer-based textcat config with: spacy init config -p textcat_multilabel -G -o accuracy tcm.cfg Train with spacy: spacy train tcm.cfg -g 0 --paths.train train.spacy --paths.dev dev.spacy -o output Then use the If you're using |
Beta Was this translation helpful? Give feedback.
I checked and using
prodigy train --base-model en_core_web_sm
or--base-model en_core_web_trf
doesn't affect the default generated textcat config, which is just BOW by default. (As a side note, if the performance is good enough for your use case, then I would recommend just using BOW. It's simple and fast.)--base-model
might make sense if you're fine-tuning an existing component already in the base model and there isn't a shared tok2vec component, but not in this case withtextcat
anden_core_web_*
. If you fine-tune an existing sharedtok2vec
ortransformer
for yourtextcat
component, it's going to degrade the performance for the other components liketagger
andparser
.If you want to ex…