-
Notifications
You must be signed in to change notification settings - Fork 121
Description
Describe the bug
The conversion fails during the process, and the model cannot be exported correctly.
I suspect this might be related to a mismatch between the Megatron checkpoint structure and the expected Hugging Face model architecture.
Steps/Code to reproduce bug
First, after pretraining with megatron-bridge, an error occurs when converting the generated distcp files.
Even when running the default conversion code without any modifications, the same error still occurs.
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py import --hf-model /path/OpenAI/openai_gpt-oss-20b/ --megatron-path ./checkpoints/megatron-checkpoint
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export --hf-model /path/OpenAI/openai_gpt-oss-20b/ --megatron-path ./checkpoints/megatron-checkpoint --hf-path ./checkpoints/hf-checkpoint
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py", line 273, in
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py", line 257, in main
[rank0]: export_megatron_to_hf(
[rank0]: File "/opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py", line 183, in export_megatron_to_hf
[rank0]: bridge.export_ckpt(
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 693, in export_ckpt
[rank0]: self.save_hf_pretrained(
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 421, in save_hf_pretrained
[rank0]: self.save_hf_weights(model, path, show_progress, strict)
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 475, in save_hf_weights
[rank0]: self.hf_pretrained.state.source.save_generator(generator, path, strict=strict)
[rank0]: File "/opt/Megatron-Bridge/src/megatron/bridge/models/hf_pretrained/state.py", line 733, in save_generator
[rank0]: raise KeyError(
[rank0]: KeyError: "Tensor 'model.layers.0.mlp.experts.gate_up_proj' from generator not found in the original model structure. To ignore, set strict=False."
Environment: nvcr.io/nvidia/nemo:25.11.00
Expected behavior
Additional context
Add any other context about the problem here.