Skip to content

Releases: MoonshotAI/checkpoint-engine

v0.3.4

28 Jan 12:34
15446dd

Choose a tag to compare

feat: support mtp in vllm, update vllm's drafter model when update_we…

v0.3.3

20 Jan 11:52
f6910d6

Choose a tag to compare

What's Changed

Full Changelog: v0.3.2...v0.3.3

v0.3.2

09 Jan 12:24
4a73109

Choose a tag to compare

What's Changed

  • fix p2p update error when disable_h2d_buffer is true by @ruizhang1230 in #76
  • fix: set current CUDA device in _inplace_pin_memory function by @SongXiaoXi in #77

Full Changelog: v0.3.1...v0.3.2

v0.3.1

05 Jan 09:00

Choose a tag to compare

Same as v0.3.1-rc0

v0.3.1-rc0

04 Jan 07:52

Choose a tag to compare

v0.3.1-rc0 Pre-release
Pre-release

What's Changed

Full Changelog: v0.3.0-rc1...v0.3.1-rc0

v0.3.0

23 Dec 11:21
88370e2

Choose a tag to compare

What's Changed

  • feat: docs added for auto_pg, and auto_pg default set to True by @specture724 in #65
  • hotfix: add a switch to disable inplace pinning of tensors by @specture724 in #68
  • hotfix: inplace pin memory caused cudaErrorHostMemoryAlreadyRegistered by @specture724 in #69
  • fix: CUDA OOM encountered with store based barrier by @specture724 in #70

Full Changelog: v0.3.0-rc0...v0.3.0-rc1

v0.2.3

18 Dec 07:18

Choose a tag to compare

  • Disable "inplace pin memory" feature by default from 0.2.2, as it may cause issues

v0.3.0-rc0

11 Dec 09:48
baf6f61

Choose a tag to compare

v0.3.0-rc0 Pre-release
Pre-release

What's Changed

  • fix: use tcp store_based_barrier to control p2p update synchronization by @specture724 in #51

Full Changelog: v0.2.2...v0.3.0-rc0

v0.2.2

11 Dec 09:33
089d185

Choose a tag to compare

What's Changed

  • fix: propagate remote exception traceback to parameter server by @SongXiaoXi in #59
  • misc: update README with environment variable instructions and vLLM version specified by @specture724 in #61
  • feat: reuse pin_memory when registering checkpoint by @specture724 in #56
  • feat: inplace pin memory for safetensors in /dev/shm/ by @specture724 in #58
  • feat: force unregister shared pin memory buffer supported by @specture724 in #62
  • feat: docs for force unregister by @specture724 in #63

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1

24 Nov 13:29
279a908

Choose a tag to compare

What's Changed

  • [Hardware] broadcast support for Huawei Ascend NPU by @kip-cxj in #39
  • [Doc] add sglang usage document by @stmatengss in #45
  • Fix ValueError handling and add device type check by @HubertZhang in #47
  • Fix wrong device_type, refine documents in worker.py by @HubertZhang in #52
  • Expose uds in UpdateRequest by @HubertZhang in #49
  • fix: test_update.py failed because _get_physical_gpu_id doesn't get 'device_manager' argument by @specture724 in #53
  • fix: add log to hint to set NCCL_IB_HCA env when _get_my_rdma_device raise an assertion failure by @specture724 in #54
  • fix: force ps to quit when error occur during updating by @specture724 in #43
  • [Hardware] p2p support for Huawei Ascend NPU by @kip-cxj in #46
  • bugfix: reset global meta when gather meta by @HubertZhang in #57

Full Changelog: v0.2.0...v0.2.1