Added Convert Deepspeed to Huggingface Safetensors script #619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

zhenga1 wants to merge 3 commits into NovaSky-AI:main from zhenga1:pr_deepspeed1

Contributor

zhenga1 commented Nov 2, 2025

Wrote script for converting a Deepspeed model shard into huggingface .safetensors.

Filepath:
~/skyrl-train/scripts/convert_deepspeed_to_hf.py

Tested on Qwen0.5B and other models trained with run_gsm8k.sh with the deepspeed backend.

zhenga1 added 2 commits

November 2, 2025 22:11


          deepspeed

a2118b6


          formatted

d9c16ed

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request adds a useful script for converting DeepSpeed checkpoints to the Hugging Face Safetensors format. The script is well-structured and includes helpful features like validation. I've found a critical bug in argument parsing that would cause a crash, a security vulnerability with subprocess, and several areas for improvement regarding robustness, code clarity, and adherence to best practices. My comments include suggestions to fix these issues.

skyrl-train/scripts/convert_deepspeed_to_hf.py Outdated Show resolved Hide resolved

skyrl-train/scripts/convert_deepspeed_to_hf.py Outdated Show resolved Hide resolved

skyrl-train/scripts/convert_deepspeed_to_hf.py

Comment on lines +50 to +54

+                  ROOT = deepspeed_model_path
+                  POLICY_DIR = ROOT / "policy"
+                  HF_BASE = POLICY_DIR / "huggingface"
+                  OUT_DIR = POLICY_DIR / "huggingface_converted" if not out_dir else out_dir
+                  MERGED_FP32 = OUT_DIR / "merged_model"  # directory that will store the ultimate pytorch weights.

Contributor

gemini-code-assist bot Nov 2, 2025

According to PEP 8, constants are named in all capital letters with underscores. These variables (ROOT, POLICY_DIR, etc.) are not true constants as their values are derived from function arguments. They should be named using snake_case (e.g., root, policy_dir) for better readability and to follow standard Python conventions. This would require updating their usage throughout the function.

skyrl-train/scripts/convert_deepspeed_to_hf.py Show resolved Hide resolved

skyrl-train/scripts/convert_deepspeed_to_hf.py

Comment on lines +81 to +84

+                  merged_bin = MERGED_FP32 / "pytorch_model.bin"
+                  hf_model_bin = HF_BASE / "pytorch_model.bin"
+                  shutil.copy2(merged_bin, hf_model_bin)
+                  print(f"    Copied to: {hf_model_bin}")

Contributor

gemini-code-assist bot Nov 2, 2025

Copying the merged model binary (pytorch_model.bin) into the HF_BASE directory is problematic. It modifies an input directory, which is a side effect that should be avoided. This copy is also redundant because you are already loading the state dictionary from merged_bin and then explicitly loading it into the model with model.load_state_dict(state, strict=False). The from_pretrained call will initialize a model from the config (with random weights if no checkpoint is found), and load_state_dict will then correctly populate its weights.

skyrl-train/scripts/convert_deepspeed_to_hf.py Outdated

+                  # === 3. Load HF config and initialize model ===
+                  print("[3/5] Initializing Hugging Face model ...")
+                  model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.float16)

Contributor

gemini-code-assist bot Nov 2, 2025

The script hardcodes AutoModelForCausalLM when initializing the model. You have a helper function guess_hf_class that can determine the correct model class from the configuration, which would make the script more robust and applicable to a wider range of models (e.g., sequence-to-sequence). This should be used here, and also in the final summary message on line 107. I've also added trust_remote_code=True for consistency with the validate_load function, as it's often required for custom models.

Suggested change

      
                model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.float16)
          
                cfg = AutoConfig.from_pretrained(HF_BASE, trust_remote_code=True)
          
                HFClass = guess_hf_class(cfg)
          
                model = HFClass.from_pretrained(HF_BASE, torch_dtype=torch.float16, trust_remote_code=True)

SumanthRH reviewed

View reviewed changes

skyrl-train/scripts/convert_deepspeed_to_hf.py

		return Path(OUT_DIR)


		def guess_hf_class(cfg: AutoConfig):

Member

SumanthRH Nov 7, 2025

A bit excessive, we only support training decoder only AutoModelForCausalLM archs right now

SumanthRH reviewed

View reviewed changes

skyrl-train/scripts/convert_deepspeed_to_hf.py Outdated Show resolved Hide resolved


          Apply suggestions from code review

7b0334d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

pbokc reviewed

View reviewed changes

skyrl-train/scripts/convert_deepspeed_to_hf.py

+              # === Directories ===
+              def main(deepspeed_model_path: Path, out_dir: Path = None) -> Path:
+                  ROOT = deepspeed_model_path
+                  POLICY_DIR = ROOT / "policy"

Contributor

pbokc Nov 8, 2025

When I ran with the Deepspeed backend, my checkpoint dir didn't have a policy folder. I'd honestly prefer to remove that and assume a structure like:

some_deepspeed_checkpoint/
├── latest
├── global_step123/
│   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── ...
├── global_step124/
│   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── ...
└── zero_to_fp32.py
└── latest

Contributor

pbokc Nov 8, 2025

Unless I just ran it wrong and the policy dir is expected

pbokc reviewed

View reviewed changes

skyrl-train/scripts/convert_deepspeed_to_hf.py

+                  # === 3. Load HF config and initialize model ===
+                  print("[3/5] Initializing Hugging Face model ...")
+                  model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.bfloat16)

Contributor

pbokc Nov 8, 2025

nit: torch_dtype is being deprecated and being replaced by dtype

pbokc mentioned this pull request

Add helper script to create HF model from checkpoints #337

Open

2 tasks

Contributor Author

zhenga1 commented Nov 10, 2025

Thank you so much @pbokc for your comments. I will rerun deepspeed, take a look and get back to you.

erictang000 added the skyrl-train label

Contributor

pbokc commented Jan 6, 2026

@zhenga1 we can close since Deepspeed backend is being deprecated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels