Skip to content

spacy-transformers not loading tokenizer #10637

@robertsonwang

Description

@robertsonwang

How to reproduce the behaviour

The tokenizer does not load properly in the TransformerModel object when loading an arbitrary BERT flavored model from the Hugging Face model repository. My apologies if this is not the right place for this issue, I cannot open an issue on the spacy-transformers repository.

Python example:

from spacy.lang.en import English
from spacy_transformers import Transformer, TransformerModel
from spacy_transformers.annotation_setters import null_annotation_setter
from spacy_transformers.span_getters import get_doc_spans

nlp = English()
model = TransformerModel(
        name="roberta-base",
        get_spans=get_doc_spans,
        tokenizer_config={"use_fast": True},
        transformer_config={}
    )
trf = Transformer(
    nlp.vocab,
    model,
    set_extra_annotations=null_annotation_setter,
    max_batch_items=4096,
)
nlp.add_pipe("transformer")

# Cannot properly tokenize text
doc = nlp("Hello")

This yields the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-ebf1789e7412> in <module>
     20 
     21 # Cannot properly tokenize text
---> 22 doc = nlp("Hello")

/usr/local/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
   1020                 raise ValueError(Errors.E109.format(name=name)) from e
   1021             except Exception as e:
-> 1022                 error_handler(name, proc, [doc], e)
   1023             if doc is None:
   1024                 raise ValueError(Errors.E005.format(name=name))

/usr/local/lib/python3.7/site-packages/spacy/util.py in raise_error(proc_name, proc, docs, e)
   1615 
   1616 def raise_error(proc_name, proc, docs, e):
-> 1617     raise e
   1618 
   1619 

/usr/local/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
   1015                 error_handler = proc.get_error_handler()
   1016             try:
-> 1017                 doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
   1018             except KeyError as e:
   1019                 # This typically happens if a component is not initialized

/usr/local/lib/python3.7/site-packages/spacy_transformers/pipeline_component.py in __call__(self, doc)
    190         """
    191         install_extensions()
--> 192         outputs = self.predict([doc])
    193         self.set_annotations([doc], outputs)
    194         return doc

/usr/local/lib/python3.7/site-packages/spacy_transformers/pipeline_component.py in predict(self, docs)
    226             activations = FullTransformerBatch.empty(len(docs))
    227         else:
--> 228             activations = self.model.predict(docs)
    229         batch_id = TransformerListener.get_batch_id(docs)
    230         for listener in self.listeners:

/usr/local/lib/python3.7/site-packages/thinc/model.py in predict(self, X)
    313         only the output, instead of the `(output, callback)` tuple.
    314         """
--> 315         return self._func(self, X, is_train=False)[0]
    316 
    317     def finish_update(self, optimizer: Optimizer) -> None:

/usr/local/lib/python3.7/site-packages/spacy_transformers/layers/transformer_model.py in forward(model, docs, is_train)
    175     if "logger" in model.attrs:
    176         log_gpu_memory(model.attrs["logger"], "begin forward")
--> 177     batch_encoding = huggingface_tokenize(tokenizer, [span.text for span in flat_spans])
    178     wordpieces = WordpieceBatch.from_batch_encoding(batch_encoding)
    179     if "logger" in model.attrs:

/usr/local/lib/python3.7/site-packages/spacy_transformers/layers/transformer_model.py in huggingface_tokenize(tokenizer, texts)
    276         return_tensors="np",
    277         return_token_type_ids=None,  # Sets to model default
--> 278         padding="longest",
    279     )
    280     token_data["input_texts"] = []

TypeError: 'NoneType' object is not callable

The same error occurs when loading a config with equivalent settings. I believe the core issue is that the init call in the TransformerModel object is not properly initializing the transformer model and tokenizer from the model config. The fix can be found by replacing this line with the following:

hf_model = huggingface_from_pretrained(name, tokenizer_config, transformer_config)

The huggingface_from_pretrained returns a HFObjects data class so I don't think there are unintended consequences but I'm sure I'm missing context/more information. Thanks for your help! Please let me know if I should post this somewhere else.

Your Environment

  • Operating System: MacOS
  • Python Version Used: Python 3.7.10
  • spaCy Version Used: spacy v3.2.4, spacy-transformers v1.1.5
  • Environment Information: Darwin-21.4.0-x86_64-i386-64bit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions