Releases · MoonshotAI/checkpoint-engine · GitHub

28 Jan 12:34

HubertZhang

v0.3.4 Latest

Latest

feat: support mtp in vllm, update vllm's drafter model when update_we…

Assets 2

20 Jan 11:52

blahgeek

v0.3.3

What's Changed

fix: npu free host cache by @kip-cxj in #78
bugfix: skip empty safetensors file in inplace_pin_memory by @HubertZhang in #79

Full Changelog: v0.3.2...v0.3.3

Contributors

HubertZhang and kip-cxj

Assets 2

09 Jan 12:24

blahgeek

v0.3.2

What's Changed

fix p2p update error when disable_h2d_buffer is true by @ruizhang1230 in #76
fix: set current CUDA device in _inplace_pin_memory function by @SongXiaoXi in #77

Full Changelog: v0.3.1...v0.3.2

Contributors

SongXiaoXi and ruizhang1230

Assets 2

05 Jan 09:00

blahgeek

v0.3.1

Same as v0.3.1-rc0

Assets 2

04 Jan 07:52

blahgeek

v0.3.1-rc0 Pre-release

Pre-release

What's Changed

Update use of environment variable in ps.py by @HubertZhang in #73
misc: split ps.py file into multiple files by @specture724 in #64
feat: cache device uuid in VllmWorkerExtension by @kip-cxj in #74

Full Changelog: v0.3.0-rc1...v0.3.1-rc0

Contributors

HubertZhang, specture724, and kip-cxj

Assets 2

23 Dec 11:21

blahgeek

v0.3.0

What's Changed

feat: docs added for auto_pg, and auto_pg default set to True by @specture724 in #65
hotfix: add a switch to disable inplace pinning of tensors by @specture724 in #68
hotfix: inplace pin memory caused cudaErrorHostMemoryAlreadyRegistered by @specture724 in #69
fix: CUDA OOM encountered with store based barrier by @specture724 in #70

Full Changelog: v0.3.0-rc0...v0.3.0-rc1

Contributors

specture724

Assets 2

18 Dec 07:18

blahgeek

v0.2.3

Disable "inplace pin memory" feature by default from 0.2.2, as it may cause issues

Assets 2

11 Dec 09:48

blahgeek

v0.3.0-rc0 Pre-release

Pre-release

What's Changed

fix: use tcp store_based_barrier to control p2p update synchronization by @specture724 in #51

Full Changelog: v0.2.2...v0.3.0-rc0

Contributors

specture724

Assets 2

11 Dec 09:33

blahgeek

v0.2.2

What's Changed

fix: propagate remote exception traceback to parameter server by @SongXiaoXi in #59
misc: update README with environment variable instructions and vLLM version specified by @specture724 in #61
feat: reuse pin_memory when registering checkpoint by @specture724 in #56
feat: inplace pin memory for safetensors in /dev/shm/ by @specture724 in #58
feat: force unregister shared pin memory buffer supported by @specture724 in #62
feat: docs for force unregister by @specture724 in #63

New Contributors

@SongXiaoXi made their first contribution in #59

Full Changelog: v0.2.1...v0.2.2

Contributors

SongXiaoXi and specture724

Assets 2

24 Nov 13:29

v0.2.1

What's Changed

[Hardware] broadcast support for Huawei Ascend NPU by @kip-cxj in #39
[Doc] add sglang usage document by @stmatengss in #45
Fix ValueError handling and add device type check by @HubertZhang in #47
Fix wrong device_type, refine documents in worker.py by @HubertZhang in #52
Expose uds in UpdateRequest by @HubertZhang in #49
fix: test_update.py failed because _get_physical_gpu_id doesn't get 'device_manager' argument by @specture724 in #53
fix: add log to hint to set NCCL_IB_HCA env when _get_my_rdma_device raise an assertion failure by @specture724 in #54
fix: force ps to quit when error occur during updating by @specture724 in #43
[Hardware] p2p support for Huawei Ascend NPU by @kip-cxj in #46
bugfix: reset global meta when gather meta by @HubertZhang in #57

Full Changelog: v0.2.0...v0.2.1

Contributors

HubertZhang, stmatengss, and 2 other contributors

Assets 2