Skip to content

[model merger] bug: verl/model_merger is not merging LoRA weights #4553

@YongchengYAO

Description

@YongchengYAO

System Info

----------Python Info----------
Version : 3.12.0
Compiler : GCC 11.2.0
Build : ('main', 'Oct 2 2023 17:29:18')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 25.3
vllm : 0.11.0
sglang : 0.5.2
ray : 2.52.1
torch : 2.8.0
----------verl Info-----------
Version : 0.7.0.dev
Directory : /mnt/vincent-pvc-rwm/verl/verl
Commit Hash : 707d46c782dc2e862b33062cf71c6e83c60932be
----------Platform Info----------
Platform : Linux-6.8.0-55-generic-x86_64-with-glibc2.35
system : Linux
release : 6.8.0-55-generic
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

verl/model_merger (doc is here) is not working as expected.

In the case of converting FSDP checkpoints to HF model format, the core function is :

  1. https://github.com/volcengine/verl/blob/7cb647d792582c117f2f37ca1c1a56128f9d90ba/verl/model_merger/fsdp_model_merger.py#L206C4-L225C45
def merge_and_save(self):
        world_size = self._get_world_size()
        rank_zero_state_dict = self._load_rank_zero_state_dict(world_size)

        mesh, mesh_dim_names = self._extract_device_mesh_info(rank_zero_state_dict, world_size)
        print(f"Got device mesh {mesh}, mesh_dim_names {mesh_dim_names}")

        total_shards, mesh_shape = self._calculate_shard_configuration(mesh, mesh_dim_names)
        print(f"Processing model shards with {total_shards} {mesh_shape} in total")

        merged_state_dict = self._load_and_merge_state_dicts(world_size, total_shards, mesh_shape, mesh_dim_names)

        if self.config.operation == "test":
            if not self.config.test_hf_dir:
                raise ValueError("test_hf_dir must be provided for test operation")
            self._validate_state_dict(merged_state_dict)
        elif self.config.operation == "merge":
            self.save_hf_model_and_tokenizer(merged_state_dict)
            if self.config.hf_upload:
                self.upload_to_huggingface()
        else:
            raise ValueError(f"Unknown operation: {self.config.operation}")
  1. https://github.com/volcengine/verl/blob/7cb647d792582c117f2f37ca1c1a56128f9d90ba/verl/model_merger/base_model_merger.py#L292C5-L317C62
def save_hf_model_and_tokenizer(self, state_dict: dict[str, torch.Tensor]):
        auto_model_class = self.get_transformers_auto_model_class()
        with init_empty_weights():
            model = auto_model_class.from_config(
                self.model_config, torch_dtype=torch.bfloat16, trust_remote_code=self.config.trust_remote_code
            )
        model.to_empty(device="cpu")
        model = self.patch_model_generation_config(model)

        lora_path = self.save_lora_adapter(state_dict)
        if lora_path:
            print(f"Saving lora adapter to {lora_path}")

        print(f"Saving model to {self.config.target_dir}")
        model.save_pretrained(self.config.target_dir, state_dict=state_dict)
        del state_dict
        del model

        processor = hf_processor(self.hf_model_config_path, trust_remote_code=self.config.trust_remote_code)
        tokenizer = hf_tokenizer(self.hf_model_config_path, trust_remote_code=self.config.trust_remote_code)
        if processor is not None:
            print(f"Saving processor to {self.config.target_dir}")
            processor.save_pretrained(self.config.target_dir)
        if tokenizer is not None:
            print(f"Saving tokenizer to {self.config.target_dir}")
            tokenizer.save_pretrained(self.config.target_dir)

I can not see any merging operation in save_hf_model_and_tokenizer()!!! ⚠️

Only the base model without LoRA is saved.

Expected behavior

Merge LoRA to the base model and save the merged model in HF .safetensors format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions