-
Notifications
You must be signed in to change notification settings - Fork 207
Description
Is your feature request related to a problem? Please describe.
I’m converting a Megatron checkpoint to HuggingFace safetensors using Megatron-Bridge.
To speed up conversion for a large model (e.g., 80B with TP=2, PP=4), I modified the conversion/export script to run in parallel (multi-process) by splitting output files and writing shards concurrently.
After enabling parallel conversion, the generated HuggingFace checkpoint appears to be incorrect:
• Shapes/dtypes look mostly consistent, but some parameters’ values differ from the expected “known-good” HF safetensors (or the model produces incorrect outputs).
• The issue only happens with the parallel version; the original sequential conversion produces correct weights.
I suspect there may be a non-deterministic / non-process-safe step in the conversion pipeline (e.g., parameter collection / ordering, TP/PP gather logic, key-to-filename mapping, or write ordering).