Add pretrained `CuratedTransformer` to pipeline for inference #13152

DanShatford · 2023-11-25T17:43:17Z

DanShatford
Nov 25, 2023

I would like to add a CuratedTransformer component with weights from the Hugging Face Hub to a pipeline, for inference only. I want to be able to add it to a pretrained a blank "en" pipeline, or one of the pretrained pipelines such as en_core_web_sm. I haven't been able to get this to work.

The config file I have tried is below. When I run python -m spacy assemble --verbose config.cfg ./pipeline, I get the error:

Error validating initialization settings in [initialize.components]
transformer -> piece_loader	extra fields not permitted
{'encoder_loader': <function build_hf_transformer_encoder_loader_v1.<locals>.load at 0x11b7a4940>, 'piece_loader': <function build_hf_piece_encoder_loader_v1.<locals>.load at 0x11b7a5ab0>}

but I thought I needed the [initialize.components.transformer.piece_loader] to get the tokenization right.

I guess I've got a misunderstanding somewhere of pipelines, the config system and assemble which I could do with some help clearing up.

I could try doing this with Transformer instead of a CuratedTransformer, but I want _.trf_data to be a DocTransformerOutput to align with en_core_web_trf and what I assume is the preferred API now.

Generally, this may seem a strange thing to want to do. I want to investigate the representations (a bit like in minicons) aligned with intuitive tokens from spaCy alongside other token annotations (lemma, POS), using the nice familiar spaCy API.

Info about spaCy

spaCy version: 3.7.2
Platform: macOS-14.1.1-x86_64-i386-64bit
Python version: 3.10.11

config.cfg

Atttempt at a config file to load "smallbenchnlp/bert-small" from the Hugging Face Hub as the only component in a pipeline, for inference:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["transformer"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.transformer]
factory = "curated_transformer"
all_layer_outputs = true
frozen = true

[components.transformer.model]
@architectures = "spacy-curated-transformers.BertTransformer.v1"
attention_probs_dropout_prob = 0.1
hidden_act = "relu"
hidden_dropout_prob = 0.1
hidden_width = 768
intermediate_width = 3072
layer_norm_eps = 0.0
max_position_embeddings = 512
model_max_length = 2147483647
num_attention_heads = 12
num_hidden_layers = 6
padding_idx = 0
type_vocab_size = 2
vocab_size = 30522
piece_encoder = {"@architectures":"spacy-curated-transformers.BertWordpieceEncoder.v1"}
torchscript = false
mixed_precision = false
wrapped_listener = null

[components.transformer.model.grad_scaler_config]

[components.transformer.model.with_spans]
@architectures = "spacy-curated-transformers.WithStridedSpans.v1"
stride = 96
window = 128
batch_size = 384

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.components.transformer]

[initialize.components.transformer.encoder_loader]
@model_loaders = "spacy-curated-transformers.HFTransformerEncoderLoader.v1"
name = "smallbenchnlp/bert-small"
revision = "main"

[initialize.components.transformer.piece_loader]
@model_loaders = "spacy-curated-transformers.HFPieceEncoderLoader.v1"
name = "smallbenchnlp/bert-small"
revision = "main"

[initialize.tokenizer]

Answered by DanShatford

Nov 28, 2023

To make a pipeline with a CuratedTransformer, bert-base-uncased component with weights from the Hugging Face Hub using spacy==3.7.2 and spacy-curated-transformers==0.2.1:

spacy init config -p curated_transformer config1.cfg
Edit config1.cfg appropriately for the model:

@architectures = "spacy-curated-transformers.BertTransformer.v1"
piece_encoder = {"@architectures":"spacy-curated-transformers.BertWordpieceEncoder.v1"}

spacy init fill-curated-transformer --model-name bert-base-uncased config1.cfg config2.cfg
spacy init fill-config config2.cfg config.cfg to get the working config.cfg below

Thanks to explosion/spacy-curated-transformers#22 (comment)

config.cfg

[paths]
train …

View full answer

DanShatford · 2023-11-26T07:16:46Z

DanShatford
Nov 26, 2023
Author

The error is because the argument should be piecer_loader not piece_loader. This came about because there is inconsistency in whether the argument to CuratedTransformer.initialize is called piecer_loader or piece_loader in spacy-curated-transformers. When I used spacy init fill-curated-transformer I got the [initialize.components.transformer.piece_loader] section above.

I would raise an issue in spacy-curated-transformers, but that doesn't seem to be possible with the repo settings there. Also, I can make a PR if you let me know which name is preferred.

1 reply

DanShatford Nov 26, 2023
Author

I am now getting the following error with the updated config below for "smallbenchnlp/bert-small" when I run python -m spacy assemble --verbose config.cfg ./pipeline:

RuntimeError: Error(s) in loading state_dict for CuratedTransformer:
	Missing key(s) in state_dict: "curated_encoder.embeddings.projection.weight", "curated_encoder.embeddings.projection.bias". 
	size mismatch for curated_encoder.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 256]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
	size mismatch for curated_encoder.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 256]) from checkpoint, the shape in current model is torch.Size([2, 768]).
	size mismatch for curated_encoder.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([512, 768]).
	size mismatch for curated_encoder.embeddings.layer_norm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for curated_encoder.embeddings.layer_norm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]).

The config for "bert-base-uncased" below seems to work, so the issue seems to be with the difference in hidden_width?

config.cfg (updated) for smallbenchnlp/bert-small - not working

config.cfg for "smallbenchnlp/bert-small" - not working

Attempt at a config file to load "smallbenchnlp/bert-small" from the Hugging Face Hub as the only component in a pipeline, for inference:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["transformer"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.transformer]
factory = "curated_transformer"
all_layer_outputs = true
frozen = true

[components.transformer.model]
@architectures = "spacy-curated-transformers.BertTransformer.v1"
attention_probs_dropout_prob = 0.1
hidden_act = "gelu"
hidden_dropout_prob = 0.1
hidden_width = 256
intermediate_width = 1024
layer_norm_eps = 0.0
max_position_embeddings = 512
model_max_length = 2147483647
num_attention_heads = 4
num_hidden_layers = 12
padding_idx = 0
type_vocab_size = 2
vocab_size = 30522
piece_encoder = {"@architectures":"spacy-curated-transformers.BertWordpieceEncoder.v1"}
torchscript = false
mixed_precision = false
wrapped_listener = null

[components.transformer.model.grad_scaler_config]

[components.transformer.model.with_spans]
@architectures = "spacy-curated-transformers.WithStridedSpans.v1"
stride = 96
window = 128
batch_size = 384

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.components.transformer]

[initialize.components.transformer.encoder_loader]
@model_loaders = "spacy-curated-transformers.HFTransformerEncoderLoader.v1"
name = "smallbenchnlp/bert-small"
revision = "main"

[initialize.components.transformer.piecer_loader]
@model_loaders = "spacy-curated-transformers.HFPieceEncoderLoader.v1"
name = "smallbenchnlp/bert-small"
revision = "main"

[initialize.tokenizer]

config.cfg for bert-base-uncased - working

A config file to load "bert-base-uncased" from the Hugging Face Hub as the only component in a pipeline, for inference, which seems to work:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["transformer"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.transformer]
factory = "curated_transformer"
all_layer_outputs = true
frozen = true

[components.transformer.model]
@architectures = "spacy-curated-transformers.BertTransformer.v1"
attention_probs_dropout_prob = 0.1
hidden_act = "gelu"
hidden_dropout_prob = 0.1
hidden_width = 768
intermediate_width = 3072
layer_norm_eps = 0.0
max_position_embeddings = 512
model_max_length = 512
num_attention_heads = 12
num_hidden_layers = 12
padding_idx = 0
type_vocab_size = 2
vocab_size = 30522
piece_encoder = {"@architectures":"spacy-curated-transformers.BertWordpieceEncoder.v1"}
torchscript = false
mixed_precision = false
wrapped_listener = null

[components.transformer.model.grad_scaler_config]

[components.transformer.model.with_spans]
@architectures = "spacy-curated-transformers.WithStridedSpans.v1"
stride = 96
window = 128
batch_size = 384

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.components.transformer]

[initialize.components.transformer.encoder_loader]
@model_loaders = "spacy-curated-transformers.HFTransformerEncoderLoader.v1"
name = "bert-base-uncased"
revision = "main"

[initialize.components.transformer.piecer_loader]
@model_loaders = "spacy-curated-transformers.HFPieceEncoderLoader.v1"
name = "bert-base-uncased"
revision = "main"

[initialize.tokenizer]

svlandeg · 2023-11-26T14:39:24Z

svlandeg
Nov 26, 2023

I would raise an issue in spacy-curated-transformers, but that doesn't seem to be possible with the repo settings there. Also, I can make a PR if you let me know which name is preferred.

We're happy to deal with spacy-curated-transformer issues & questions here on spaCy's main issue tracker & discussion forum - it helps us not to have to keep an eye on a thousand different trackers ;-)

The error is because the argument should be piecer_loader not piece_loader. This came about because there is inconsistency in whether the argument to CuratedTransformer.initialize is called piecer_loader or piece_loader in spacy-curated-transformers. When I used spacy init fill-curated-transformer I got the [initialize.components.transformer.piece_loader] section above.

Oof - good catch. That's definitely a bug. I wonder how this got past by the unit tests, we should look into that.

We'd appreciate a PR! At first glance it looks like piecer_loader is the original naming so we should keep that. Also, our released en_core_web_trf package uses [initialize.components.transformer.piecer_loader] so we'll want to stick to that.

1 reply

DanShatford Nov 26, 2023
Author

Took a closer look into the RuntimeError above and it has already been fixed in explosion/spacy-curated-transformers#17, so I can now achieve my goal for "smallbenchnlp/bert-small" as well as "bert-base-uncased", provided I use the latest unreleased changes in spacy-curated-transformers 🎉.

I've made a start on a PR for the piece_loader -> piecer_loader change, explosion/spacy-curated-transformers#22.

We're happy to deal with spacy-curated-transformer issues & questions here on spaCy's main issue tracker & discussion forum - it helps us not to have to keep an eye on a thousand different trackers ;-)

Makes sense!

DanShatford · 2023-11-28T17:07:23Z

DanShatford
Nov 28, 2023
Author

To make a pipeline with a CuratedTransformer, bert-base-uncased component with weights from the Hugging Face Hub using spacy==3.7.2 and spacy-curated-transformers==0.2.1:

spacy init config -p curated_transformer config1.cfg
Edit config1.cfg appropriately for the model:

@architectures = "spacy-curated-transformers.BertTransformer.v1"
piece_encoder = {"@architectures":"spacy-curated-transformers.BertWordpieceEncoder.v1"}

spacy init fill-curated-transformer --model-name bert-base-uncased config1.cfg config2.cfg
spacy init fill-config config2.cfg config.cfg to get the working config.cfg below

Thanks to explosion/spacy-curated-transformers#22 (comment)

config.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["curated_transformer"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.curated_transformer]
factory = "curated_transformer"
all_layer_outputs = false
frozen = false

[components.curated_transformer.model]
@architectures = "spacy-curated-transformers.BertTransformer.v1"
piece_encoder = {"@architectures":"spacy-curated-transformers.BertWordpieceEncoder.v1"}
vocab_size = 30522
attention_probs_dropout_prob = 0.1
hidden_act = "gelu"
hidden_dropout_prob = 0.1
hidden_width = 768
intermediate_width = 3072
layer_norm_eps = 0.0
max_position_embeddings = 512
model_max_length = 512
num_attention_heads = 12
num_hidden_layers = 12
padding_idx = 0
type_vocab_size = 2
torchscript = false
mixed_precision = false
wrapped_listener = null

[components.curated_transformer.model.grad_scaler_config]

[components.curated_transformer.model.with_spans]
@architectures = "spacy-curated-transformers.WithStridedSpans.v1"
stride = 96
window = 128
batch_size = 384

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.components.curated_transformer]

[initialize.components.curated_transformer.encoder_loader]
@model_loaders = "spacy-curated-transformers.HFTransformerEncoderLoader.v1"
name = "bert-base-uncased"
revision = "main"

[initialize.components.curated_transformer.piecer_loader]
@model_loaders = "spacy-curated-transformers.HFPieceEncoderLoader.v1"
name = "bert-base-uncased"
revision = "main"

[initialize.tokenizer]

0 replies

Uh oh!

Add pretrained CuratedTransformer to pipeline for inference #13152

Uh oh!

Uh oh!

DanShatford Nov 25, 2023

Info about spaCy

config.cfg

config.cfg

Replies: 3 comments · 2 replies

Uh oh!

DanShatford Nov 26, 2023 Author

Uh oh!

DanShatford Nov 26, 2023 Author

config.cfg for "smallbenchnlp/bert-small" - not working

config.cfg for bert-base-uncased - working

Uh oh!

svlandeg Nov 26, 2023

Uh oh!

Uh oh!

DanShatford Nov 26, 2023 Author

Uh oh!

Uh oh!

DanShatford Nov 28, 2023 Author

config.cfg

Add pretrained `CuratedTransformer` to pipeline for inference #13152

DanShatford
Nov 25, 2023

Replies: 3 comments 2 replies

DanShatford
Nov 26, 2023
Author

DanShatford Nov 26, 2023
Author

svlandeg
Nov 26, 2023

DanShatford Nov 26, 2023
Author

DanShatford
Nov 28, 2023
Author