Fix loading of legacy HuggingFace BERT checkpoints. by drivanov · Pull Request #10631 · pyg-team/pytorch_geometric

drivanov · 2026-03-06T17:07:50Z

Some legacy HuggingFace checkpoints such as prajjwal1/bert-tiny do not contain a config.json with the model_type field required by recent versions of Transformers.
As a result, AutoModelForSequenceClassification.from_pretrained() fails with:

  File "/workspace/examples/llm/glem.py", line 461, in <module>
    main(args)
  File "/workspace/examples/llm/glem.py", line 88, in main
    tag_dataset = TAGDataset(root, dataset, hf_model,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/tag_dataset.py", line 89, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 773, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_tokenizers.py", line 341, in __init__
    raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

This PR adds a small fallback for such models by directly using BertForSequenceClassification, while keeping the default AutoModelForSequenceClassification path for all other models.

for more information, see https://pre-commit.ci

…tric into tokenizer # Conflicts: # torch_geometric/datasets/tag_dataset.py

codecov · 2026-03-06T17:55:56Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.18%. Comparing base (c211214) to head (990e373).
⚠️ Report is 185 commits behind head on master.

Files with missing lines	Patch %	Lines
torch_geometric/llm/models/glem.py	0.00%	6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10631      +/-   ##
==========================================
- Coverage   86.11%   84.18%   -1.93%     
==========================================
  Files         496      510      +14     
  Lines       33655    36017    +2362     
==========================================
+ Hits        28981    30321    +1340     
- Misses       4674     5696    +1022

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

…tric into tokenizer

for more information, see https://pre-commit.ci

…o tokenizer

for more information, see https://pre-commit.ci

drivanov · 2026-03-19T16:21:41Z

@akihironitta: I see exactly the same issues with the code coverage check as in PR#10623. Otherwise, in my opinion, this PR is ready to merge.

Fix loading of legacy HuggingFace BERT checkpoints.

8073cb8

drivanov requested review from akihironitta, puririshi98, rusty1s and wsad1 as code owners March 6, 2026 17:07

pre-commit-ci bot and others added 3 commits March 6, 2026 17:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad7ce0e

for more information, see https://pre-commit.ci

Updating test_model_summary

6625639

Merge branch 'tokenizer' of https://github.com/drivanov/pytorch_geome…

f3e8f94

…tric into tokenizer # Conflicts: # torch_geometric/datasets/tag_dataset.py

drivanov and others added 7 commits March 6, 2026 09:59

Fixing lint problems

a71317c

Fixing code coverage for glem.py

c27a8f3

[pre-commit.ci] auto fixes from pre-commit.com hooks

142500e

for more information, see https://pre-commit.ci

Fixing minor bug

9b00931

Merge branch 'tokenizer' of https://github.com/drivanov/pytorch_geome…

f1399ce

…tric into tokenizer

[pre-commit.ci] auto fixes from pre-commit.com hooks

b54a95a

for more information, see https://pre-commit.ci

Fixing PreTrainedTokenizerBase import error

bf4bbb4

AJamal27891 mentioned this pull request Mar 9, 2026

Fix brittle test_summary_with_to_hetero_model assertion broken by tabulate 0.10.0 #10633

Closed

drivanov and others added 11 commits March 10, 2026 12:42

Merge branch 'master' into tokenizer

d0bb4de

Adding decorator

dfe50d4

Merge branch 'tokenizer' of github.com:drivanov/pytorch_geometric int…

a02b7df

…o tokenizer

Adding decorator

2283621

Merge branch 'master' into tokenizer

d7638b9

Merge branch 'master' into tokenizer

0f57a2c

Fixing invalid # noqa directive

0a075e3

[pre-commit.ci] auto fixes from pre-commit.com hooks

5ddb661

for more information, see https://pre-commit.ci

Fixing invalid # noqa directive

df0b8fa

[pre-commit.ci] auto fixes from pre-commit.com hooks

211880a

for more information, see https://pre-commit.ci

Merge branch 'master' into tokenizer

990e373

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loading of legacy HuggingFace BERT checkpoints.#10631

Fix loading of legacy HuggingFace BERT checkpoints.#10631
drivanov wants to merge 22 commits intopyg-team:masterfrom
drivanov:tokenizer

drivanov commented Mar 6, 2026

Uh oh!

codecov bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

drivanov commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drivanov commented Mar 6, 2026

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

drivanov commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 6, 2026 •

edited

Loading