config file - [initialize] vectors = null meaning #11575

Jason-B-Jiang · 2022-10-03T17:46:56Z

Jason-B-Jiang
Oct 3, 2022

Hi,

I am trying to train new NER models from scratch, for a number of existing spaCy pipelines (en_core_web_lg, and en_core_sci_lg + en_core_sci_scibert from scispaCy, if you need to know).

To generate the config files for training new NER models in these pipelines, I used the config generation script from Explosion's ner_demo_replace project (https://github.com/explosion/projects/blob/v3/pipelines/ner_demo_replace/scripts/create_config.py). I noticed that in either [initialize] or [paths] section of my config files, vectors is given the value 'null'.

Does this mean spaCy is also training the word vectors for these pipelines from scratch? I want to take advantage of the pre-trained embeddings for each model, so this would not be ideal.

I am copy and pasting one such config file I've generated using this script.

[paths]
vectors = "output/en_core_sci_lg_vectors"
init_tok2vec = null
parser_tagger_path = "output/en_core_sci_lg_parser_tagger/model-best"
vocab_path = "project_data/vocab_lg.jsonl"
train = null
dev = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","attribute_ruler","lemmatizer","parser","ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000

[components]

[components.attribute_ruler]
source = "en_core_sci_lg"

[components.lemmatizer]
source = "en_core_sci_lg"

[components.ner]
source = "en_core_sci_lg"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "en_core_sci_lg"

[components.tagger]
source = "en_core_sci_lg"

[components.tok2vec]
source = "en_core_sci_lg"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 7
max_steps = 0
eval_frequency = 500
frozen_components = ["tok2vec","tagger","attribute_ruler","lemmatizer","parser"]
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 1
stop = 32
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = null
lemma_acc = 0.5
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
ents_f = 0.5
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = ${paths.vocab_path}
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "en_core_sci_lg"
vocab = "en_core_sci_lg"

[initialize.components]

[initialize.tokenizer]

[vars]
include_static_vectors = "True"

Thank you in advance - cheers! :)

Answered by polm

Oct 4, 2022

Does this mean spaCy is also training the word vectors for these pipelines from scratch? I want to take advantage of the pre-trained embeddings for each model, so this would not be ideal.

There are two kinds of vectors in spaCy:

word vectors, where one exists for each different word (or with subword features like in floret), which are used as input to a tok2vec
tok2vec vectors, which are generated by a CNN tok2vec or transformer layer and used as input for statistical components

The vectors entry in the config, as well as use_static_vectors, refers to word vectors. If those are null, spaCy will just not use word vectors at all, and use other features of tokens as input to the tok2vec/…

View full answer

polm · 2022-10-04T05:39:43Z

polm
Oct 4, 2022

Does this mean spaCy is also training the word vectors for these pipelines from scratch? I want to take advantage of the pre-trained embeddings for each model, so this would not be ideal.

There are two kinds of vectors in spaCy:

word vectors, where one exists for each different word (or with subword features like in floret), which are used as input to a tok2vec
tok2vec vectors, which are generated by a CNN tok2vec or transformer layer and used as input for statistical components

The vectors entry in the config, as well as use_static_vectors, refers to word vectors. If those are null, spaCy will just not use word vectors at all, and use other features of tokens as input to the tok2vec/transformer. spaCy has no built in features for learning word vectors, though we maintain floret. See this part of the docs or this FAQ.

Note that in generated configs, usually vectors will have a value if you are not using a GPU and choose "accuracy".

In this case it sounds like you want to use whatever you can from the existing pipelines - in that case I would recommend using the word vectors by simply writing the pipeline name in vectors. It's also possible to re-use the tok2vec, but that won't work since you're sourcing components, which need to use the unmodified tok2vec. Also note when sourcing components, the source pipeline and current pipeline need to have the same vectors.

About your config - you're sourcing and then freezing many components, but if you want to train new NER models and add them to that pipeline, I would recommend you train the NER components by themselves one at a time, with no sourced components, just using the word vectors. Then you can assembled your NER components and the original pipeline into one pipeline; this example project may be helpful.

Also note that when sourcing, you should replace listeners on any statistical components, which in this case would include the parser, tagger, and (if you want it) existing NER.

1 reply

adrianeboyd Oct 4, 2022

To clarify, in this particular config the vectors are copied as part of the vocab in spacy.copy_from_base_model.v1 in [initialize.before_init] rather than in [initialize.vectors]. The spacy.copy_from_base_model.v1 option also copies some tokenizer settings (and maybe some lookups?) that you wouldn't get automatically with a new config from spacy init config. It's fine to train with this config because the remaining components are frozen.

In particular with the scispacy models, the tokenizer has some custom settings that you want to be sure that you are using in your NER training, since otherwise your component won't work as expected when you add it to the existing pipeline.

I don't know how you ended up with the [vars] section, that may be a mistake or a bug. The include_static_vectors settings will be sourced from the components in the original pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

config file - [initialize] vectors = null meaning #11575

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

config file - [initialize] vectors = null meaning #11575

Uh oh!

Uh oh!

Jason-B-Jiang Oct 3, 2022

Replies: 1 comment · 1 reply

Uh oh!

polm Oct 4, 2022

Uh oh!

adrianeboyd Oct 4, 2022

Jason-B-Jiang
Oct 3, 2022

Replies: 1 comment 1 reply

polm
Oct 4, 2022