Skip to content

Fix loading of legacy HuggingFace BERT checkpoints.#10631

Open
drivanov wants to merge 22 commits intopyg-team:masterfrom
drivanov:tokenizer
Open

Fix loading of legacy HuggingFace BERT checkpoints.#10631
drivanov wants to merge 22 commits intopyg-team:masterfrom
drivanov:tokenizer

Conversation

@drivanov
Copy link
Contributor

@drivanov drivanov commented Mar 6, 2026

Some legacy HuggingFace checkpoints such as prajjwal1/bert-tiny do not contain a config.json with the model_type field required by recent versions of Transformers.
As a result, AutoModelForSequenceClassification.from_pretrained() fails with:

  File "/workspace/examples/llm/glem.py", line 461, in <module>
    main(args)
  File "/workspace/examples/llm/glem.py", line 88, in main
    tag_dataset = TAGDataset(root, dataset, hf_model,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/tag_dataset.py", line 89, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 773, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_tokenizers.py", line 341, in __init__
    raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

This PR adds a small fallback for such models by directly using BertForSequenceClassification, while keeping the default AutoModelForSequenceClassification path for all other models.

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.18%. Comparing base (c211214) to head (990e373).
⚠️ Report is 185 commits behind head on master.

Files with missing lines Patch % Lines
torch_geometric/llm/models/glem.py 0.00% 6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10631      +/-   ##
==========================================
- Coverage   86.11%   84.18%   -1.93%     
==========================================
  Files         496      510      +14     
  Lines       33655    36017    +2362     
==========================================
+ Hits        28981    30321    +1340     
- Misses       4674     5696    +1022     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drivanov
Copy link
Contributor Author

@akihironitta: I see exactly the same issues with the code coverage check as in PR#10623. Otherwise, in my opinion, this PR is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants