Skip to content

Conversation

@zhenga1
Copy link
Contributor

@zhenga1 zhenga1 commented Nov 2, 2025

Wrote script for converting a Deepspeed model shard into huggingface .safetensors.

Filepath:
~/skyrl-train/scripts/convert_deepspeed_to_hf.py

Tested on Qwen0.5B and other models trained with run_gsm8k.sh with the deepspeed backend.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a useful script for converting DeepSpeed checkpoints to the Hugging Face Safetensors format. The script is well-structured and includes helpful features like validation. I've found a critical bug in argument parsing that would cause a crash, a security vulnerability with subprocess, and several areas for improvement regarding robustness, code clarity, and adherence to best practices. My comments include suggestions to fix these issues.

Comment on lines +50 to +54
ROOT = deepspeed_model_path
POLICY_DIR = ROOT / "policy"
HF_BASE = POLICY_DIR / "huggingface"
OUT_DIR = POLICY_DIR / "huggingface_converted" if not out_dir else out_dir
MERGED_FP32 = OUT_DIR / "merged_model" # directory that will store the ultimate pytorch weights.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, constants are named in all capital letters with underscores. These variables (ROOT, POLICY_DIR, etc.) are not true constants as their values are derived from function arguments. They should be named using snake_case (e.g., root, policy_dir) for better readability and to follow standard Python conventions. This would require updating their usage throughout the function.

Comment on lines +81 to +84
merged_bin = MERGED_FP32 / "pytorch_model.bin"
hf_model_bin = HF_BASE / "pytorch_model.bin"
shutil.copy2(merged_bin, hf_model_bin)
print(f" Copied to: {hf_model_bin}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Copying the merged model binary (pytorch_model.bin) into the HF_BASE directory is problematic. It modifies an input directory, which is a side effect that should be avoided. This copy is also redundant because you are already loading the state dictionary from merged_bin and then explicitly loading it into the model with model.load_state_dict(state, strict=False). The from_pretrained call will initialize a model from the config (with random weights if no checkpoint is found), and load_state_dict will then correctly populate its weights.


# === 3. Load HF config and initialize model ===
print("[3/5] Initializing Hugging Face model ...")
model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.float16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script hardcodes AutoModelForCausalLM when initializing the model. You have a helper function guess_hf_class that can determine the correct model class from the configuration, which would make the script more robust and applicable to a wider range of models (e.g., sequence-to-sequence). This should be used here, and also in the final summary message on line 107. I've also added trust_remote_code=True for consistency with the validate_load function, as it's often required for custom models.

Suggested change
model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.float16)
cfg = AutoConfig.from_pretrained(HF_BASE, trust_remote_code=True)
HFClass = guess_hf_class(cfg)
model = HFClass.from_pretrained(HF_BASE, torch_dtype=torch.float16, trust_remote_code=True)

return Path(OUT_DIR)


def guess_hf_class(cfg: AutoConfig):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit excessive, we only support training decoder only AutoModelForCausalLM archs right now

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
# === Directories ===
def main(deepspeed_model_path: Path, out_dir: Path = None) -> Path:
ROOT = deepspeed_model_path
POLICY_DIR = ROOT / "policy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran with the Deepspeed backend, my checkpoint dir didn't have a policy folder. I'd honestly prefer to remove that and assume a structure like:

some_deepspeed_checkpoint/
├── latest
├── global_step123/
│   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── ...
├── global_step124/
│   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── ...
└── zero_to_fp32.py
└── latest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I just ran it wrong and the policy dir is expected


# === 3. Load HF config and initialize model ===
print("[3/5] Initializing Hugging Face model ...")
model = AutoModelForCausalLM.from_pretrained(HF_BASE, torch_dtype=torch.bfloat16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: torch_dtype is being deprecated and being replaced by dtype

@zhenga1
Copy link
Contributor Author

zhenga1 commented Nov 10, 2025

Thank you so much @pbokc for your comments. I will rerun deepspeed, take a look and get back to you.

@pbokc
Copy link
Contributor

pbokc commented Jan 6, 2026

@zhenga1 we can close since Deepspeed backend is being deprecated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants