Skip to content

Is parallel (multi-process) checkpoint conversion supported? #1984

@liyandong001

Description

@liyandong001

Is your feature request related to a problem? Please describe.
I’m converting a Megatron checkpoint to HuggingFace safetensors using Megatron-Bridge.
To speed up conversion for a large model (e.g., 80B with TP=2, PP=4), I modified the conversion/export script to run in parallel (multi-process) by splitting output files and writing shards concurrently.

After enabling parallel conversion, the generated HuggingFace checkpoint appears to be incorrect:
• Shapes/dtypes look mostly consistent, but some parameters’ values differ from the expected “known-good” HF safetensors (or the model produces incorrect outputs).
• The issue only happens with the parallel version; the original sequential conversion produces correct weights.

I suspect there may be a non-deterministic / non-process-safe step in the conversion pipeline (e.g., parameter collection / ordering, TP/PP gather logic, key-to-filename mapping, or write ordering).


Metadata

Metadata

Assignees

No one assigned

    Labels

    community-requestfeatureNew capabilities, enhancements, or enablement work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions