Releases: MoonshotAI/checkpoint-engine
Releases · MoonshotAI/checkpoint-engine
v0.3.4
v0.3.3
What's Changed
- fix: npu free host cache by @kip-cxj in #78
- bugfix: skip empty safetensors file in inplace_pin_memory by @HubertZhang in #79
Full Changelog: v0.3.2...v0.3.3
v0.3.2
What's Changed
- fix p2p update error when disable_h2d_buffer is true by @ruizhang1230 in #76
- fix: set current CUDA device in _inplace_pin_memory function by @SongXiaoXi in #77
Full Changelog: v0.3.1...v0.3.2
v0.3.1
v0.3.1-rc0
What's Changed
- Update use of environment variable in ps.py by @HubertZhang in #73
- misc: split ps.py file into multiple files by @specture724 in #64
- feat: cache device uuid in VllmWorkerExtension by @kip-cxj in #74
Full Changelog: v0.3.0-rc1...v0.3.1-rc0
v0.3.0
What's Changed
- feat: docs added for
auto_pg, andauto_pgdefault set to True by @specture724 in #65 - hotfix: add a switch to disable inplace pinning of tensors by @specture724 in #68
- hotfix: inplace pin memory caused
cudaErrorHostMemoryAlreadyRegisteredby @specture724 in #69 - fix: CUDA OOM encountered with store based barrier by @specture724 in #70
Full Changelog: v0.3.0-rc0...v0.3.0-rc1
v0.2.3
v0.3.0-rc0
What's Changed
- fix: use tcp store_based_barrier to control p2p update synchronization by @specture724 in #51
Full Changelog: v0.2.2...v0.3.0-rc0
v0.2.2
What's Changed
- fix: propagate remote exception traceback to parameter server by @SongXiaoXi in #59
- misc: update README with environment variable instructions and vLLM version specified by @specture724 in #61
- feat: reuse pin_memory when registering checkpoint by @specture724 in #56
- feat: inplace pin memory for safetensors in /dev/shm/ by @specture724 in #58
- feat: force unregister shared pin memory buffer supported by @specture724 in #62
- feat: docs for force unregister by @specture724 in #63
New Contributors
- @SongXiaoXi made their first contribution in #59
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- [Hardware] broadcast support for Huawei Ascend NPU by @kip-cxj in #39
- [Doc] add sglang usage document by @stmatengss in #45
- Fix ValueError handling and add device type check by @HubertZhang in #47
- Fix wrong device_type, refine documents in worker.py by @HubertZhang in #52
- Expose uds in UpdateRequest by @HubertZhang in #49
- fix: test_update.py failed because _get_physical_gpu_id doesn't get 'device_manager' argument by @specture724 in #53
- fix: add log to hint to set NCCL_IB_HCA env when _get_my_rdma_device raise an assertion failure by @specture724 in #54
- fix: force ps to quit when error occur during updating by @specture724 in #43
- [Hardware] p2p support for Huawei Ascend NPU by @kip-cxj in #46
- bugfix: reset global meta when gather meta by @HubertZhang in #57
Full Changelog: v0.2.0...v0.2.1