spacy v3 pretrain / static vectors ndim mismatch #7474

adamkgoldfarb · 2021-03-17T15:03:32Z

adamkgoldfarb
Mar 17, 2021

Hello!

I'm doing a bit of pretraining before loading up some glove and fasttext vectors for training on CPU. I'm hitting a mismatch when training:

Attempt to change dimension 'nM' for model 'static_vectors' from 300 to 0

I haven't changed the config from pretraining to training, but I did change the attrs from ["ORTH","SHAPE"] to ["NORM","PREFIX","SUFFIX","SHAPE"] to align with the recommendations in the documentation. Not sure if that would have any affect, but since I'm not changing the config I assumed it would be ok from pretraining to training.

Going to move ahead without the pretraining bit but wondering if anything springs to mind for you!

Here are the relevant elements of the config, as far as I can tell (let me know if you need more):

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true

[pretraining]
max_epochs = 1000
dropout = 0.2
n_save_every = null
component = "tok2vec"
layer = ""
corpus = "corpora.pretrain"

[pretraining.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 3000
discard_oversize = false
tolerance = 0.2
get_length = null

[pretraining.objective]
@architectures = "spacy.PretrainCharacters.v1"
maxout_pieces = 3
hidden_size = 300
n_characters = 4

[pretraining.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

svlandeg · 2021-03-18T11:04:24Z

svlandeg
Mar 18, 2021

Hi Adam, sorry to hear you're running into trouble!

I could be mistaken - but I don't think this is due to the attrs you changed. Instead, this error feels like the vectors aren't being found (setting to dim 0). Are your paths all correct?

Could you share the full config file, and the exact command(s) you ran that led to this error? It might also help to share the full stack trace, as that will have additional clues :-)

9 replies

svlandeg Mar 24, 2021

That's really difficult to say, it depends on your specific challenge and dataset. I'd be interested in your results though ;-)

The reason why you need the static vectors is really because of this bit in your config:

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
include_static_vectors = true
...
[initialize]
vectors = ${paths.vectors}

When you pass in the argument --paths.vectors you're changing the config file by overwriting paths.vectors, and thus you're also changing the internal structure of your neural networks. And you need to make sure that these are exactly the same between pretraining and training, or spaCy will notice they're not, and will throw the Attempt to change dimension error you saw initially.

adamkgoldfarb Mar 25, 2021
Author

Interesting results to share! Hopefully these are helpful reflections for the community.

I'm using CPU-- it's an 8-Core Intel Core i9 but it's still CPU. I was following default recommendations to go with CPU-based training because that'd allow for faster iteration using a wide variety of configurations. So it has.
I used the PretrainVectors architecture and let pretraining run on my relatively small corpus of 18k (longish) texts for 20 epochs. That took a good eight to ten hours per pretraining, one set of weights for each of these fasttext vectors and glove vectors. I figured I'd only have pretrain once against each vector set, so it was worth it to let the computer huff and puff.
I am optimizing for recall (0.6R/0.4P weighting) because this pipeline is leading to an entity linker and we'd rather capture more entities here while maintaining face validity for demonstration purposes.
I saw no significant difference in performance in recall or accuracy when using vector-based pretraining + fasttext's subword-based vectors VS. vector-based pretraining + vectors trained without subword information. Recall and Accuracy are pretty high: 93 R and 88.4 P for the fasttext wiki_news_300d_1M vectors with pretraining predicting my one label, which is a pretty tractable label as it aligns fairly well with LOC/GPE labels in other models.
Training with subword vectors and the character pretraining objective got us... less than the PretrainVectors architecture! Though not by much in terms of recall, and precision was actually better. I have a feeling my use case is not the right one for character-based pretraining.
Training with roberta-base as drop-in replacement tok2vec got us to 95% recall and 91.5% precision in about 6 hours! That compares favorably to overall accuracy in the static vector-based winner, and in less time overall than pretraining plus training. I suppose that's not super surprising-- roberta has been "pretrained" on a huge data set, just not the one we're using.
Training with distilroberta-base even passes the "anxious analyst" test by showing us a progress_bar that moves at a noticeable (albeit still slow) clip, taking about 4.5 hrs total.
The size of the resulting roberta-base and distilroberta-base models are also 2-4x smaller than those that need to pass around static fasttext or glove vectors.
As you might imagine, performance at runtime using transformers on CPU suffers. See the table below. The overall tradeoff between performance at training and inference time vs. accuracy holds-- in our case though, we have time to sit around and let the transformers do their thing while we go about our business. This is a batch process we're running rather than a real-time one, and I think trading off speed for accuracy is acceptable for us.

model	recall (%)	precision (%)	speed (w/s?)	size (mb)
lea_glove_6B_300d	91.3	86.3	5720	527
lea_ft_wiki_news_300d_1M_subword	90.5	89.1	6030	1300
lea_ft_wiki_news_300d_1m	92.9	88.4	6002	1300
lea_distilroberta_base_trf	92.9	92.0	1619	330
lea_roberta_base_trf	95.0	91.5	1005	501
lea_distilbert_conll03_trf	93.9	92.2	2206	262
[Model Perfomance]

Just for posterity, the phrase "This requires a word vectors model to be trained and loaded" in the Pretraining Objectives section of the docs implied to me that they are required for PretrainVectors but not for PretrainCharacters, though my understanding was incorrect if I'm using static vectors later on during training, which I am here.

So I'm clear, are vectors required for the characters objective if I'm not using static vectors? In other words, would it make sense to pretrain on the characters objective and use that to init the tok2vec if you're not using static vectors as a feature, as long as that's reflected in your config? Thanks for indulging me on this question!

And more broadly, all of the above would not have been possible without the excellent work your team has done to enable rapid experimentation, so thank you, Thank You, THANK YOU!

Edit: added lea_distilbert_conll03_trf based on Elastic's distilbert model finetuned for NER. I should also note that I'm looking forward to finally trying out the entity linker using the prodigy recipe from your tutorial!

adamkgoldfarb Mar 27, 2021
Author

@svlandeg Let me know if I need to take this to Stack Overflow, as this isn't really the right thread, but: are there plans in the works to adapt nel-wikipedia for v3? I think that might be a good exercise for me to take on as I'm diving into wikidata anyway, but don't want to clash if work on that is well underway.

svlandeg Mar 27, 2021

Hi, new questions can always go in a new topic on this forum ;-)
But no, we're not planning on porting that Wikidata/Wikipedia parsing code to the v3 core library.

adamkgoldfarb Mar 28, 2021
Author

Good to know-- thanks again!!

Uh oh!

spacy v3 pretrain / static vectors ndim mismatch #7474

Uh oh!

adamkgoldfarb Mar 17, 2021

Replies: 1 comment · 9 replies

Uh oh!

svlandeg Mar 18, 2021

Uh oh!

Uh oh!

svlandeg Mar 24, 2021

Uh oh!

Uh oh!

adamkgoldfarb Mar 25, 2021 Author

Uh oh!

adamkgoldfarb Mar 27, 2021 Author

Uh oh!

Uh oh!

svlandeg Mar 27, 2021

Uh oh!

adamkgoldfarb Mar 28, 2021 Author

adamkgoldfarb
Mar 17, 2021

Replies: 1 comment 9 replies

svlandeg
Mar 18, 2021

adamkgoldfarb Mar 25, 2021
Author

adamkgoldfarb Mar 27, 2021
Author

adamkgoldfarb Mar 28, 2021
Author