Is parallel (multi-process) checkpoint conversion supported?

Is your feature request related to a problem? Please describe.
I’m converting a Megatron checkpoint to HuggingFace safetensors using Megatron-Bridge.
To speed up conversion for a large model (e.g., 80B with TP=2, PP=4), I modified the conversion/export script to run in parallel (multi-process) by splitting output files and writing shards concurrently.
﻿
After enabling parallel conversion, the generated HuggingFace checkpoint appears to be incorrect:
 • Shapes/dtypes look mostly consistent, but some parameters’ values differ from the expected “known-good” HF safetensors (or the model produces incorrect outputs).
 • The issue only happens with the parallel version; the original sequential conversion produces correct weights.
﻿
I suspect there may be a non-deterministic / non-process-safe step in the conversion pipeline (e.g., parameter collection / ordering, TP/PP gather logic, key-to-filename mapping, or write ordering).
﻿


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is parallel (multi-process) checkpoint conversion supported? #1984

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is parallel (multi-process) checkpoint conversion supported? #1984

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions