Skip to content

Error when converting gpt-oss-20b from Megatron to Hugging Face format #1819

@slpmee

Description

@slpmee

Describe the bug

The conversion fails during the process, and the model cannot be exported correctly.
I suspect this might be related to a mismatch between the Megatron checkpoint structure and the expected Hugging Face model architecture.

Steps/Code to reproduce bug

First, after pretraining with megatron-bridge, an error occurs when converting the generated distcp files.
Even when running the default conversion code without any modifications, the same error still occurs.

uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py import --hf-model /path/OpenAI/openai_gpt-oss-20b/ --megatron-path ./checkpoints/megatron-checkpoint

uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export --hf-model /path/OpenAI/openai_gpt-oss-20b/ --megatron-path ./checkpoints/megatron-checkpoint --hf-path ./checkpoints/hf-checkpoint

[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py", line 273, in
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py", line 257, in main
[rank0]: export_megatron_to_hf(
[rank0]: File "/opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py", line 183, in export_megatron_to_hf
[rank0]: bridge.export_ckpt(
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 693, in export_ckpt
[rank0]: self.save_hf_pretrained(
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 421, in save_hf_pretrained
[rank0]: self.save_hf_weights(model, path, show_progress, strict)
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 475, in save_hf_weights
[rank0]: self.hf_pretrained.state.source.save_generator(generator, path, strict=strict)
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/hf_pretrained/state.py", line 733, in save_generator
[rank0]: raise KeyError(
[rank0]: KeyError: "Tensor 'model.layers.0.mlp.experts.gate_up_proj' from generator not found in the original model structure. To ignore, set strict=False."

Environment: nvcr.io/nvidia/nemo:25.11.00
Expected behavior

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions