DeepSeek-V2(-Lite) checkpoint converted by megatron-bridge cannot be loaded by Megatron mcore (r0.15): attention LN parameter schema mismatch

Hi，

I encountered a checkpoint compatibility issue when converting DeepSeek-V2-Lite from HuggingFace format to Megatron mcore format using megatron-bridge, and then loading it for CPT training with Megatron-LM core r0.15.

The issue happens even with tp=1, pp=1, ep=4
convert script:

python3 -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
      --hf-model-id  deepseekai/DeepSeek-V2-Lite --tp 1 --pp 1 --ep 4 \
      --megatron-save-path /data/DeepSeek-V2-Lite-tp1-pp1-ep4

When I load the mcore checkpoint with megatron-core-0.15.0, the error as follow:

[rank3]: Traceback (most recent call last): (RANK 3)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/utils.py", line 192, in reduce_scatter
[rank3]:     local_data = map_fun()
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper
[rank3]:     result = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/state_dict_loader.py", line 266, in local_step
[rank3]:     local_plan = planner.create_local_plan()
[rank3]:   File "/Megatron-LM-core_r0.15.0/megatron/core/dist_checkpointing/strategies/torch.py", line 647, in create_local_plan
[rank3]:     self._validate_global_shapes(self.metadata, self.shapes_validation_sharded_tensors)
[rank3]:   File "//Megatron-LM-core_r0.15.0/megatron/core/dist_checkpointing/strategies/torch.py", line 600, in _validate_global_shapes
[rank3]:     raise KeyError(
[rank3]: KeyError: "decoder.layers.0.self_attention.linear_qkv.layer_norm_weight from model not in state dict: ['decoder.final_layernorm._extra_state/shard_0_1', 'decoder.final_layernorm.weight', 'decoder.layers.0.input_layernorm._extra_state/shard_0_1', 'decoder.layers.0.input_layernorm.weight', 'decoder.layers.0.mlp.linear_fc1._extra_state/shard_0_1', 'decoder.layers.0.mlp.linear_fc1.layer_norm_weight', 'decoder.layers.0.mlp.linear_fc1.weight', 'decoder.layers.0.mlp.linear_fc2._extra_state/shard_0_1', 'decoder.layers.0.mlp.linear_fc2.weight', 'decoder.layers.0.self_attention.linear_kv_down_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_kv_down_proj.weight', 'decoder.layers.0.self_attention.linear_kv_up_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_kv_up_proj.layer_norm_weight', 'decoder.layers.0.self_attention.linear_kv_up_proj.weight', 'decoder.layers.0.self_attention.linear_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_proj.weight', 'decoder.layers.0.self_attention.linear_q_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_q_proj.weight', 'decoder.layers.1.input_layernorm._extra_state/shard_0_1', 'decoder.layers.1.input_layernorm.weight', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_0_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_10_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_11_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_12_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_13_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_14_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_15_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_16_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_17_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_18_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_19_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_1_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_20_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_21_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_22_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_23_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_24_64',
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek-V2(-Lite) checkpoint converted by megatron-bridge cannot be loaded by Megatron mcore (r0.15): attention LN parameter schema mismatch #1821

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSeek-V2(-Lite) checkpoint converted by megatron-bridge cannot be loaded by Megatron mcore (r0.15): attention LN parameter schema mismatch #1821

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions