Skip to content

DeepSeek-V2(-Lite) checkpoint converted by megatron-bridge cannot be loaded by Megatron mcore (r0.15): attention LN parameter schema mismatch #1821

@yubin1991

Description

@yubin1991

Hi,

I encountered a checkpoint compatibility issue when converting DeepSeek-V2-Lite from HuggingFace format to Megatron mcore format using megatron-bridge, and then loading it for CPT training with Megatron-LM core r0.15.

The issue happens even with tp=1, pp=1, ep=4
convert script:

python3 -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_megatron_roundtrip_multi_gpu.py
--hf-model-id deepseekai/DeepSeek-V2-Lite --tp 1 --pp 1 --ep 4
--megatron-save-path /data/DeepSeek-V2-Lite-tp1-pp1-ep4

When I load the mcore checkpoint with megatron-core-0.15.0, the error as follow:

[rank3]: Traceback (most recent call last): (RANK 3)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/utils.py", line 192, in reduce_scatter
[rank3]: local_data = map_fun()
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper
[rank3]: result = func(*args, **kwargs)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/state_dict_loader.py", line 266, in local_step
[rank3]: local_plan = planner.create_local_plan()
[rank3]: File "/Megatron-LM-core_r0.15.0/megatron/core/dist_checkpointing/strategies/torch.py", line 647, in create_local_plan
[rank3]: self._validate_global_shapes(self.metadata, self.shapes_validation_sharded_tensors)
[rank3]: File "//Megatron-LM-core_r0.15.0/megatron/core/dist_checkpointing/strategies/torch.py", line 600, in _validate_global_shapes
[rank3]: raise KeyError(
[rank3]: KeyError: "decoder.layers.0.self_attention.linear_qkv.layer_norm_weight from model not in state dict: ['decoder.final_layernorm._extra_state/shard_0_1', 'decoder.final_layernorm.weight', 'decoder.layers.0.input_layernorm._extra_state/shard_0_1', 'decoder.layers.0.input_layernorm.weight', 'decoder.layers.0.mlp.linear_fc1._extra_state/shard_0_1', 'decoder.layers.0.mlp.linear_fc1.layer_norm_weight', 'decoder.layers.0.mlp.linear_fc1.weight', 'decoder.layers.0.mlp.linear_fc2._extra_state/shard_0_1', 'decoder.layers.0.mlp.linear_fc2.weight', 'decoder.layers.0.self_attention.linear_kv_down_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_kv_down_proj.weight', 'decoder.layers.0.self_attention.linear_kv_up_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_kv_up_proj.layer_norm_weight', 'decoder.layers.0.self_attention.linear_kv_up_proj.weight', 'decoder.layers.0.self_attention.linear_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_proj.weight', 'decoder.layers.0.self_attention.linear_q_proj._extra_state/shard_0_1', 'decoder.layers.0.self_attention.linear_q_proj.weight', 'decoder.layers.1.input_layernorm._extra_state/shard_0_1', 'decoder.layers.1.input_layernorm.weight', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_0_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_10_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_11_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_12_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_13_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_14_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_15_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_16_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_17_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_18_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_19_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_1_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_20_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_21_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_22_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_23_64', 'decoder.layers.1.mlp.experts.experts.linear_fc1._extra_state/shard_24_64',

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions