-
Notifications
You must be signed in to change notification settings - Fork 231
Added Convert Deepspeed to Huggingface Safetensors script #619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,164 @@ | ||||||||||
| """ | ||||||||||
| Systematic converter: DeepSpeed ZeRO checkpoint → Hugging Face safetensors model. | ||||||||||
|
|
||||||||||
| Assumptions: | ||||||||||
| - You have a structure like: | ||||||||||
| data.pt | ||||||||||
| trainer_state.pt | ||||||||||
| policy/ | ||||||||||
| ├── global_step_x/ | ||||||||||
| │ ├── zero_pp_rank_0_mp_rank_00_model_states.pt | ||||||||||
| │ └── zero_pp_rank_0_mp_rank_00_optim_states.pt | ||||||||||
| ├── huggingface/ | ||||||||||
| │ ├── config.json, tokenizer.json, etc. | ||||||||||
| └── zero_to_fp32.py | ||||||||||
| └── latest | ||||||||||
|
|
||||||||||
|
|
||||||||||
| Output: | ||||||||||
| policy/huggingface_converted/model.safetensors (+ copied config/tokenizer) | ||||||||||
|
|
||||||||||
| For Deepspeed model shards, the output directory will be created with the following structure: | ||||||||||
| . | ||||||||||
| ├── added_tokens.json | ||||||||||
| ├── chat_template.jinja (optional: this file is for chat specific tasks) | ||||||||||
| ├── config.json | ||||||||||
| ├── generation_config.json (optional: default decoding parameters) | ||||||||||
| ├── merges.txt | ||||||||||
| ├── model.safetensors | ||||||||||
| ├── special_tokens_map.json | ||||||||||
| ├── tokenizer.json | ||||||||||
| ├── tokenizer_config.json | ||||||||||
| └── vocab.json | ||||||||||
|
|
||||||||||
| Example usage: | ||||||||||
| uv run --isolated --frozen --extra vllm scripts/convert_deepspeed_to_hf.py --ckpt-dir [local_checkpoint] --out-dir [output_directory] | ||||||||||
| """ | ||||||||||
|
|
||||||||||
| import shutil | ||||||||||
| import os | ||||||||||
| import subprocess | ||||||||||
| import argparse | ||||||||||
| import torch | ||||||||||
| from pathlib import Path | ||||||||||
| from safetensors.torch import save_model | ||||||||||
| from transformers import AutoModelForCausalLM, AutoConfig, AutoModelForSeq2SeqLM, AutoModel | ||||||||||
|
|
||||||||||
|
|
||||||||||
| # === Directories === | ||||||||||
| def main(deepspeed_model_path: Path, out_dir: Path = None) -> Path: | ||||||||||
| ROOT = deepspeed_model_path | ||||||||||
| POLICY_DIR = ROOT / "policy" | ||||||||||
| HF_BASE = POLICY_DIR / "huggingface" | ||||||||||
| OUT_DIR = POLICY_DIR / "huggingface_converted" if not out_dir else out_dir | ||||||||||
| MERGED_FP32 = OUT_DIR / "merged_model" # directory that will store the ultimate pytorch weights. | ||||||||||
|
Comment on lines
+50
to
+54
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to PEP 8, constants are named in all capital letters with underscores. These variables ( |
||||||||||
|
|
||||||||||
| OUT_DIR.mkdir(exist_ok=True, parents=True) | ||||||||||
|
|
||||||||||
| # === 1. Merge ZeRO shards into single FP32 checkpoint === | ||||||||||
| zero2fp32_script = POLICY_DIR / "zero_to_fp32.py" | ||||||||||
SumanthRH marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
|
||||||||||
| if not MERGED_FP32.exists(): | ||||||||||
| print(f"[1/5] Merging ZeRO shards from {POLICY_DIR} ...") | ||||||||||
| cmd = f"python {zero2fp32_script} {POLICY_DIR} {MERGED_FP32}" | ||||||||||
| result = subprocess.run(cmd) | ||||||||||
| if result.returncode != 0: | ||||||||||
| raise RuntimeError("zero_to_fp32.py merge failed.") | ||||||||||
SumanthRH marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
| else: | ||||||||||
| print(f"[1/5] Merged model already exists → {MERGED_FP32}") | ||||||||||
|
|
||||||||||
| # === 2. Load merged state dict === | ||||||||||
| print("[2/5] Loading merged model ...") | ||||||||||
| state = torch.load(MERGED_FP32 / "pytorch_model.bin", map_location="cpu") | ||||||||||
|
|
||||||||||
| # Handle possible wrapper keys | ||||||||||
| if isinstance(state, dict): | ||||||||||
| for key in ["module", "model_state_dict", "state_dict"]: | ||||||||||
| if key in state: | ||||||||||
| state = state[key] | ||||||||||
| break | ||||||||||
|
|
||||||||||
| merged_bin = MERGED_FP32 / "pytorch_model.bin" | ||||||||||
| hf_model_bin = HF_BASE / "pytorch_model.bin" | ||||||||||
| shutil.copy2(merged_bin, hf_model_bin) | ||||||||||
| print(f" Copied to: {hf_model_bin}") | ||||||||||
|
Comment on lines
+84
to
+87
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Copying the merged model binary ( |
||||||||||
|
|
||||||||||
| # === 3. Load HF config and initialize model === | ||||||||||
| print("[3/5] Initializing Hugging Face model ...") | ||||||||||
| model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.float16) | ||||||||||
|
||||||||||
| model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.float16) | |
| cfg = AutoConfig.from_pretrained(HF_BASE, trust_remote_code=True) | |
| HFClass = guess_hf_class(cfg) | |
| model = HFClass.from_pretrained(HF_BASE, torch_dtype=torch.float16, trust_remote_code=True) |
SumanthRH marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit excessive, we only support training decoder only AutoModelForCausalLM archs right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran with the Deepspeed backend, my checkpoint dir didn't have a
policyfolder. I'd honestly prefer to remove that and assume a structure like:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I just ran it wrong and the policy dir is expected