Releases: MoonshotAI/checkpoint-engine
Releases · MoonshotAI/checkpoint-engine
v0.2.2
What's Changed
- fix: propagate remote exception traceback to parameter server by @SongXiaoXi in #59
- misc: update README with environment variable instructions and vLLM version specified by @specture724 in #61
- feat: reuse pin_memory when registering checkpoint by @specture724 in #56
- feat: inplace pin memory for safetensors in /dev/shm/ by @specture724 in #58
- feat: force unregister shared pin memory buffer supported by @specture724 in #62
- feat: docs for force unregister by @specture724 in #63
New Contributors
- @SongXiaoXi made their first contribution in #59
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- [Hardware] broadcast support for Huawei Ascend NPU by @kip-cxj in #39
- [Doc] add sglang usage document by @stmatengss in #45
- Fix ValueError handling and add device type check by @HubertZhang in #47
- Fix wrong device_type, refine documents in worker.py by @HubertZhang in #52
- Expose uds in UpdateRequest by @HubertZhang in #49
- fix: test_update.py failed because _get_physical_gpu_id doesn't get 'device_manager' argument by @specture724 in #53
- fix: add log to hint to set NCCL_IB_HCA env when _get_my_rdma_device raise an assertion failure by @specture724 in #54
- fix: force ps to quit when error occur during updating by @specture724 in #43
- [Hardware] p2p support for Huawei Ascend NPU by @kip-cxj in #46
- bugfix: reset global meta when gather meta by @HubertZhang in #57
Full Changelog: v0.2.0...v0.2.1
v0.2.0
Feature
See #25. We speedup the P2P implementation and make it have the same speed of broadcast. Also, we bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds, which will speedup the update duration. We update the test result.
| Model | Device Info | GatherMetas | Update (Broadcast) | Update (P2P) |
|---|---|---|---|---|
| GLM-4.5-Air (BF16) | 8xH800 TP8 | 0.12s | 3.47s (3.02GiB) | 4.12s (3.02GiB) |
| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8 | 0.33s | 6.22s (2.67GiB) | 7.10s (2.68GiB) |
| DeepSeek-V3.1 (FP8) | 16xH20 TP16 | 1.17s | 10.19s (5.39GiB) | 11.80s (5.41GiB) |
| Kimi-K2-Instruct (FP8) | 16xH20 TP16 | 1.33s | 14.36s (5.89GiB) | 17.49s (5.91GiB) |
| DeepSeek-V3.1 (FP8) | 256xH20 TP16 | 0.80s | 11.33s (8.00GiB) | 11.81s (8.00GiB) |
| Kimi-K2-Instruct (FP8) | 256xH20 TP16 | 1.22s | 16.04s (8.00GiB) | 16.75s (8.00GiB) |
What's Changed
- fix: use correct type for _current_global_parameter_metas by @weixiao-huang in #33
- A more reasonable way to obtain RDMA devices by @specture724 in #36
- optimize _update_per_bucket_p2p logic by @specture724 in #28
Full Changelog: v0.1.3...v0.2.0
v0.1.3
What's Changed
- fix register_files fastapi parameter parse error by @ruizhang1230 in #27
- fix destroy process group error when using p2p update by @ruizhang1230 in #30
- feat: support configurable gpu count and memory fraction by @zxpdemonio in #29
Full Changelog: v0.1.2...v0.1.3
v0.1.2
What's Changed
- feat: use zmq_addr_counter to make zmq_handle non-repeat for each update by @weixiao-huang in #4
- feat: add pre-commit as lint config by @weixiao-huang in #5
- feat: add pre-commit CI workflow by @specture724 in #10
- feat: make
ParameterMetaJSON serializable by @weixiao-huang in #9 - feat: rename
save_metas_file->load_metas_fileinjoinmethod by @weixiao-huang in #11 - chore: set
mooncake-transfer-engine>=0.3.5by @weixiao-huang in #13 - feat: support uds and use httpx instead of requests by @weixiao-huang in #18
- feat: add rank and world_size args in ParameterServer by @weixiao-huang in #20
- feat: use ibv_get_device_list to get rdma devices instead of getting from file by @weixiao-huang in #19
- feat: use torch.cuda.get_device_properties() to get device_uuid instead of nvidia-smi -L by @weixiao-huang in #21
- hotfix: use correct hca selector by @weixiao-huang in #22
- feat: add _TorchTensor type for pydantic type validator by @weixiao-huang in #24
Full Changelog: https://github.com/MoonshotAI/checkpoint-engine/commits/v0.1.2