-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
System Info
----------Python Info----------
Version : 3.12.0
Compiler : GCC 11.2.0
Build : ('main', 'Oct 2 2023 17:29:18')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 25.3
vllm : 0.11.0
sglang : 0.5.2
ray : 2.52.1
torch : 2.8.0
----------verl Info-----------
Version : 0.7.0.dev
Directory : /mnt/vincent-pvc-rwm/verl/verl
Commit Hash : 707d46c782dc2e862b33062cf71c6e83c60932be
----------Platform Info----------
Platform : Linux-6.8.0-55-generic-x86_64-with-glibc2.35
system : Linux
release : 6.8.0-55-generic
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
verl/model_merger (doc is here) is not working as expected.
In the case of converting FSDP checkpoints to HF model format, the core function is :
def merge_and_save(self):
world_size = self._get_world_size()
rank_zero_state_dict = self._load_rank_zero_state_dict(world_size)
mesh, mesh_dim_names = self._extract_device_mesh_info(rank_zero_state_dict, world_size)
print(f"Got device mesh {mesh}, mesh_dim_names {mesh_dim_names}")
total_shards, mesh_shape = self._calculate_shard_configuration(mesh, mesh_dim_names)
print(f"Processing model shards with {total_shards} {mesh_shape} in total")
merged_state_dict = self._load_and_merge_state_dicts(world_size, total_shards, mesh_shape, mesh_dim_names)
if self.config.operation == "test":
if not self.config.test_hf_dir:
raise ValueError("test_hf_dir must be provided for test operation")
self._validate_state_dict(merged_state_dict)
elif self.config.operation == "merge":
self.save_hf_model_and_tokenizer(merged_state_dict)
if self.config.hf_upload:
self.upload_to_huggingface()
else:
raise ValueError(f"Unknown operation: {self.config.operation}")def save_hf_model_and_tokenizer(self, state_dict: dict[str, torch.Tensor]):
auto_model_class = self.get_transformers_auto_model_class()
with init_empty_weights():
model = auto_model_class.from_config(
self.model_config, torch_dtype=torch.bfloat16, trust_remote_code=self.config.trust_remote_code
)
model.to_empty(device="cpu")
model = self.patch_model_generation_config(model)
lora_path = self.save_lora_adapter(state_dict)
if lora_path:
print(f"Saving lora adapter to {lora_path}")
print(f"Saving model to {self.config.target_dir}")
model.save_pretrained(self.config.target_dir, state_dict=state_dict)
del state_dict
del model
processor = hf_processor(self.hf_model_config_path, trust_remote_code=self.config.trust_remote_code)
tokenizer = hf_tokenizer(self.hf_model_config_path, trust_remote_code=self.config.trust_remote_code)
if processor is not None:
print(f"Saving processor to {self.config.target_dir}")
processor.save_pretrained(self.config.target_dir)
if tokenizer is not None:
print(f"Saving tokenizer to {self.config.target_dir}")
tokenizer.save_pretrained(self.config.target_dir)I can not see any merging operation in save_hf_model_and_tokenizer()!!!
Only the base model without LoRA is saved.
Expected behavior
Merge LoRA to the base model and save the merged model in HF .safetensors format.