Releases · MoonshotAI/checkpoint-engine

11 Dec 09:33

blahgeek

v0.2.2

089d185

v0.2.2

What's Changed

fix: propagate remote exception traceback to parameter server by @SongXiaoXi in #59
misc: update README with environment variable instructions and vLLM version specified by @specture724 in #61
feat: reuse pin_memory when registering checkpoint by @specture724 in #56
feat: inplace pin memory for safetensors in /dev/shm/ by @specture724 in #58
feat: force unregister shared pin memory buffer supported by @specture724 in #62
feat: docs for force unregister by @specture724 in #63

New Contributors

@SongXiaoXi made their first contribution in #59

Full Changelog: v0.2.1...v0.2.2

Contributors

SongXiaoXi and specture724

Assets 2

24 Nov 13:29

weixiao-huang

v0.2.1

279a908

v0.2.1

What's Changed

[Hardware] broadcast support for Huawei Ascend NPU by @kip-cxj in #39
[Doc] add sglang usage document by @stmatengss in #45
Fix ValueError handling and add device type check by @HubertZhang in #47
Fix wrong device_type, refine documents in worker.py by @HubertZhang in #52
Expose uds in UpdateRequest by @HubertZhang in #49
fix: test_update.py failed because _get_physical_gpu_id doesn't get 'device_manager' argument by @specture724 in #53
fix: add log to hint to set NCCL_IB_HCA env when _get_my_rdma_device raise an assertion failure by @specture724 in #54
fix: force ps to quit when error occur during updating by @specture724 in #43
[Hardware] p2p support for Huawei Ascend NPU by @kip-cxj in #46
bugfix: reset global meta when gather meta by @HubertZhang in #57

Full Changelog: v0.2.0...v0.2.1

Contributors

HubertZhang, stmatengss, and 2 other contributors

Assets 2

30 Oct 02:27

weixiao-huang

v0.2.0

a291782

v0.2.0

Feature

See #25. We speedup the P2P implementation and make it have the same speed of broadcast. Also, we bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds, which will speedup the update duration. We update the test result.

Model	Device Info	GatherMetas	Update (Broadcast)	Update (P2P)
GLM-4.5-Air (BF16)	8xH800 TP8	0.12s	3.47s (3.02GiB)	4.12s (3.02GiB)
Qwen3-235B-A22B-Instruct-2507 (BF16)	8xH800 TP8	0.33s	6.22s (2.67GiB)	7.10s (2.68GiB)
DeepSeek-V3.1 (FP8)	16xH20 TP16	1.17s	10.19s (5.39GiB)	11.80s (5.41GiB)
Kimi-K2-Instruct (FP8)	16xH20 TP16	1.33s	14.36s (5.89GiB)	17.49s (5.91GiB)
DeepSeek-V3.1 (FP8)	256xH20 TP16	0.80s	11.33s (8.00GiB)	11.81s (8.00GiB)
Kimi-K2-Instruct (FP8)	256xH20 TP16	1.22s	16.04s (8.00GiB)	16.75s (8.00GiB)

What's Changed

fix: use correct type for _current_global_parameter_metas by @weixiao-huang in #33
A more reasonable way to obtain RDMA devices by @specture724 in #36
optimize _update_per_bucket_p2p logic by @specture724 in #28

Full Changelog: v0.1.3...v0.2.0

Contributors

weixiao-huang and specture724

Assets 2

14 Oct 08:10

weixiao-huang

v0.1.3

8a60e65

v0.1.3

What's Changed

fix register_files fastapi parameter parse error by @ruizhang1230 in #27
fix destroy process group error when using p2p update by @ruizhang1230 in #30
feat: support configurable gpu count and memory fraction by @zxpdemonio in #29

Full Changelog: v0.1.2...v0.1.3

Contributors

zxpdemonio and ruizhang1230

Assets 2

22 Sep 14:26

weixiao-huang

v0.1.2

716c0da

v0.1.2

What's Changed

feat: use zmq_addr_counter to make zmq_handle non-repeat for each update by @weixiao-huang in #4
feat: add pre-commit as lint config by @weixiao-huang in #5
feat: add pre-commit CI workflow by @specture724 in #10
feat: make ParameterMeta JSON serializable by @weixiao-huang in #9
feat: rename save_metas_file -> load_metas_file in join method by @weixiao-huang in #11
chore: set mooncake-transfer-engine>=0.3.5 by @weixiao-huang in #13
feat: support uds and use httpx instead of requests by @weixiao-huang in #18
feat: add rank and world_size args in ParameterServer by @weixiao-huang in #20
feat: use ibv_get_device_list to get rdma devices instead of getting from file by @weixiao-huang in #19
feat: use torch.cuda.get_device_properties() to get device_uuid instead of nvidia-smi -L by @weixiao-huang in #21
hotfix: use correct hca selector by @weixiao-huang in #22
feat: add _TorchTensor type for pydantic type validator by @weixiao-huang in #24

Full Changelog: https://github.com/MoonshotAI/checkpoint-engine/commits/v0.1.2

Contributors

weixiao-huang and specture724

Assets 2

Releases: MoonshotAI/checkpoint-engine

v0.2.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.1

What's Changed

Contributors

Uh oh!

v0.2.0

Feature

What's Changed

Contributors

Uh oh!

v0.1.3

What's Changed

Contributors

Uh oh!

v0.1.2

What's Changed

Contributors

Uh oh!