Shape mismatch #12555

Gitclop · 2023-04-20T10:39:32Z

Gitclop
Apr 20, 2023

i am sorry i have to open another topic. After i managed to train a ner model with the use of my own wordvectors, i get the following error when calling

nlp(string)

ValueError: Shape mismatch for blis.gemm: (12, 96), (256, 64)

I've used this config-file:

#python -m spacy train NER\full_config.cfg --output ./spacy_pack/training --paths.train NER\ner_train.spacy --paths.dev NER\ner_train.spacy
[paths]
train = null
dev = null
vectors = "C:/spacy_pack/Spacy-Model/"
init_tok2vec = "C:/spacy_pack/Spacy-Model/"

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "de"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Answered by rmitsch

Apr 24, 2023

Thanks for the context. I'm afraid we only have two options here (1) retrain all the embedding-using components you want to have in your pipeline (tagger, dependency parser NER) using your converted Gensim embeddings or (2) use two pipelines.

While i can see this is doable with two pipelines, it seems overly complicated.

I can see that it's annoying having to use two pipelines, but I don't think it'll be complicated to implement or use (less so than option (1) for sure). You'd load your tagger/parser pipeline and your NER pipeline and process your docs as nlp_tagger_parser(nlp_ner(text)).

View full answer

rmitsch · 2023-04-20T12:43:01Z

rmitsch
Apr 20, 2023

Hi @Gitclop! I'd recommend setting paths.init_tok2vec = null. Let me know if that works.

0 replies

Gitclop · 2023-04-20T13:05:28Z

Gitclop
Apr 20, 2023
Author

hey @rmitsch!
Same Error: ValueError: Shape mismatch for blis.gemm: (12, 96), (256, 64)

I belive it hase something to do with how i use my wordvectors.
I trained them with gensim word2vec and added them to a vocab like this:

    def addWordVectors(self, f):
        path = self.path
        Vectorfile = f
        gensim_model = Word2Vec.load(self.path_to_w2v+Vectorfile)
        vocab = Vocab()
        # Hinzufügen von Wörtern und Vektoren aus dem gensim-Modell
        for word in gensim_model.wv.index_to_key:
            vector = gensim_model.wv.get_vector(word)
            vocab.set_vector(word, vector)
        return vocab

    def setup(self, f):

        nlp = spacy.load('de_core_news_lg', exclude=['senter', 'ner'])
        senter_source = spacy.load(self.path+'Senter/Senterv1')
        ner_source = spacy.load(self.path+'training/model-best')
        
        nlp.add_pipe('senter', source=senter_source, before='parser')
        nlp.vocab.vectors = self.addWordVectors(f).vectors
        nlp.add_pipe('ner', source=ner_source)

        path = self.path+'Spacy-Model'
        nlp.to_disk(path=path)

        return nlp

I didnt touch the tok2vec component

1 reply

rmitsch Apr 20, 2023

Just swapping out the word vectors won't work. Use spacy init vectors to convert your vectors and then retrain your model.

Gitclop · 2023-04-21T09:48:36Z

Gitclop
Apr 21, 2023
Author

I exported the gensim vectors as .txt file and ran

python -m spacy init vectors de spacy_pack/w2v-model-v6_Line_cbow_mean0_epoch10_neg-5_window20.txt spacy_pack/Spacy-Model --verbose

It updates my model

I train it again, with the same config, load my model with

nlp = spacy.load(path)

and again i get the error: ValueError: Shape mismatch for blis.gemm: (12, 96), (256, 64)

0 replies

Gitclop · 2023-04-21T10:38:53Z

Gitclop
Apr 21, 2023
Author

I was wrong. It is not working.

python -m spacy init vectors de spacy_pack/w2v-model-v6_Line_cbow_mean0_epoch10_neg-5_window20.txt spacy_pack/Spacy-Vectors --verbose
generates a Model with just a vocab folder and some configs.

In the config for the ner-training i use those vectors:

[paths]
train = null
dev = null
vectors = "spacy_pack/Spacy-Vectors/"
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "de"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"


[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

i initialize and save my pipeline


    def setup(self):

        nlp = spacy.load('de_core_news_lg', exclude=['senter', 'ner'])

        vector_source = self.path + 'Spacy-Vectors/vocab'
        senter_source = spacy.load(self.path+'Spacy-Senter')
        ner_source = spacy.load(self.path + 'Spacy-Ner')

        nlp.vocab.vectors.from_disk(vector_source)
        nlp.add_pipe('senter', source=senter_source, before='parser')
        nlp.add_pipe('ner', source=ner_source)

        path = self.path+'Spacy-Model'
        nlp.to_disk(path=path)

        return nlp

and i get:
ValueError: Shape mismatch for blis.gemm: (12, 96), (256, 64)

If i build my pipeline without the ner model, the wordvectors are loaded and it works just fine.
If i train my ner model with
vectors = null
i get the same shape msimatch error

I've also tried setting the tok2vec component with

tok2vec_source = spacy.load(self.path + 'Spacy-Ner')
nlp.add_pipe('tok2vec', source=tok2vec_source)

which gives me an errror when initilaizing the pipeline
ValueError: Attempt to change dimension 'nO' for model 'tok2vec-listener' from 256 to 96

Sorry for the confusion, but i am out of ideas :(

1 reply

rmitsch Apr 24, 2023

A pipeline can only have one set of vectors - all components in a pipeline, if they use vectors, have to be configured/trained with this set of vectors.

So I recommend training a new NER model from scratch (sourcing from e. g. de_core_news_lg won't work, as it was trained and configured with a different set of vectors) after you ran spacy init vectors to convert the Gensim vectors. Swapping out the vectors in an existing model won't work.

Gitclop · 2023-04-24T09:39:23Z

Gitclop
Apr 24, 2023
Author

I think thats exactly what i am doing:

spacy init vectors generates my vectors in /Spacy-Vectors. I use those vectors in the full_config for the ner training
vectors = "spacy_pack/Spacy-Vectors/"

Then i build my pipeline with de_core_news_lg as the base and add the vectors to the vocab and my trained ner-modell to the pipeline

def setup(self):

        nlp = spacy.load('de_core_news_lg', exclude=['senter', 'ner'])

        #tok2vec_source = spacy.load(self.path + 'Spacy-Ner')
        vector_source = self.path + 'Spacy-Vectors/vocab'
        senter_source = spacy.load(self.path+'Spacy-Senter')
        ner_source = spacy.load(self.path + 'Spacy-Ner')

        nlp.vocab.vectors.from_disk(vector_source)
        #nlp.add_pipe('tok2vec', source=tok2vec_source)
        nlp.add_pipe('senter', source=senter_source, before='parser')
        nlp.add_pipe('ner', source=ner_source)

        path = self.path+'Spacy-Model'
        nlp.to_disk(path=path)

        return nlp

all components in a pipeline, if they use vectors, have to be configured/trained with this set of vectors.

I use the de_core_news_lg model because i want to use its tagger and dependency-parser how would i train this pipeline with my wordvectors? I don't have any annotated training data, just the corpus for the word2vec algorithm. (And of course annotations for the ner-training)

Or is it just the tok2vec componets that needs to be trained?

(The tok2vec model-file in trained ner-model is 33mb large, compared to 6mb in the Spacy-Model folder)

1 reply

rmitsch Apr 24, 2023

In setup() you're loading a trained model and swapping out (1) the word vectors and (2) adding components from a different (your trained NER) pipeline. I'm recommending training your own pipeline from scratch using the config you posted, adjusted for using the converted Gensim embeddings. I. e. no loading de_core_news_lg and swapping out components.

I use the de_core_news_lg model because i want to use its tagger and dependency-parser how would i train this pipeline with my wordvectors? I don't have any annotated training data, just the corpus for the word2vec algorithm. (And of course annotations for the ner-training)

The related datasets are available for research use, but it's a bit tedious to convert them to the right format. Alternative suggestion: use two pipelines - one with the pretrained tagger and dependency parser components and one with the NER component you trained, based on the Gensim vectors. This simplifies the logistics at train time and makes inference IMO only slightly more inconvenient.

Gitclop · 2023-04-24T12:42:27Z

Gitclop
Apr 24, 2023
Author

My goal ist to extract segments of text around my entities. For example:
"Last friday i started a project on extracting information about my_entity" -> "extracting information about my_entity"
While i can see this is doable with two pipelines, it seems overly complicated.

1 reply

rmitsch Apr 24, 2023

To take a step back - why do you want to use Gensim's embeddings?

Gitclop · 2023-04-24T13:40:57Z

Gitclop
Apr 24, 2023
Author

I work in tech-support in a special domain. The end-goal is to find similar, or duplicate support-tickets. I allready use word embeddings to calculate document similarity. Because of the special domain, there were no meaningfull embeddings within spacy for my corpus. The trained gensim vectors give pretty acurate results now.
Right now i load those vectors into my spacy pipeline, parse the text to filter out stopwords, extract just 'NOUN', 'PROPN' and 'VERB' and generate a document-vector from this "trimmed" text.
One problem left however is the noise within the tickets. Users tend to ramble a lot, so i want to use the NER-Tagger to extract the "essence" of the problem. Again, because of the special domain i can't use the tagger that comes with spacy

3 replies

rmitsch Apr 24, 2023

Thanks for the context. I'm afraid we only have two options here (1) retrain all the embedding-using components you want to have in your pipeline (tagger, dependency parser NER) using your converted Gensim embeddings or (2) use two pipelines.

While i can see this is doable with two pipelines, it seems overly complicated.

I can see that it's annoying having to use two pipelines, but I don't think it'll be complicated to implement or use (less so than option (1) for sure). You'd load your tagger/parser pipeline and your NER pipeline and process your docs as nlp_tagger_parser(nlp_ner(text)).

Answer selected by Gitclop

Gitclop Apr 24, 2023
Author

Thank you for your patience with me, your help is greatly appreciated!
I didn't know i could pass the nlp object to another pipeline and combine the information, thats definitely easier then training the components. And since speed is not critical for the task i can live with 2 pipelines.
Again, thank you very much!

rmitsch Apr 24, 2023

No worries, thanks for your feedback! To be precise, you don't pass the nlp object to another pipeline, but its output - which is always a Doc instance. The second pipeline just updates the Doc instance then.

Uh oh!

Shape mismatch #12555

Uh oh!

Replies: 7 comments · 7 replies

Uh oh!

Uh oh!

Uh oh!

Gitclop Apr 20, 2023 Author

Uh oh!

Uh oh!

Gitclop Apr 21, 2023 Author

Uh oh!

Uh oh!

Gitclop Apr 21, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Gitclop Apr 24, 2023 Author

Uh oh!

Uh oh!

Gitclop Apr 24, 2023 Author

Uh oh!

Uh oh!

Gitclop Apr 24, 2023 Author

Uh oh!

Uh oh!

Gitclop Apr 24, 2023 Author

Uh oh!

Replies: 7 comments 7 replies

Gitclop
Apr 20, 2023
Author

Gitclop
Apr 21, 2023
Author

Gitclop
Apr 21, 2023
Author

Gitclop
Apr 24, 2023
Author

Gitclop
Apr 24, 2023
Author

Gitclop
Apr 24, 2023
Author

Gitclop Apr 24, 2023
Author