Release v0.2.0 · MoonshotAI/checkpoint-engine

Feature

See #25. We speedup the P2P implementation and make it have the same speed of broadcast. Also, we bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds, which will speedup the update duration. We update the test result.

Model	Device Info	GatherMetas	Update (Broadcast)	Update (P2P)
GLM-4.5-Air (BF16)	8xH800 TP8	0.12s	3.47s (3.02GiB)	4.12s (3.02GiB)
Qwen3-235B-A22B-Instruct-2507 (BF16)	8xH800 TP8	0.33s	6.22s (2.67GiB)	7.10s (2.68GiB)
DeepSeek-V3.1 (FP8)	16xH20 TP16	1.17s	10.19s (5.39GiB)	11.80s (5.41GiB)
Kimi-K2-Instruct (FP8)	16xH20 TP16	1.33s	14.36s (5.89GiB)	17.49s (5.91GiB)
DeepSeek-V3.1 (FP8)	256xH20 TP16	0.80s	11.33s (8.00GiB)	11.81s (8.00GiB)
Kimi-K2-Instruct (FP8)	256xH20 TP16	1.22s	16.04s (8.00GiB)	16.75s (8.00GiB)

What's Changed

fix: use correct type for _current_global_parameter_metas by @weixiao-huang in #33
A more reasonable way to obtain RDMA devices by @specture724 in #36
optimize _update_per_bucket_p2p logic by @specture724 in #28

Full Changelog: v0.1.3...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Feature

What's Changed

Contributors

Uh oh!