HighLights
- Refinement of PipelineStore Architecture and Enhancement of Core Capabilities #653 #711
- Now supports 3FS for scalable and efficient storage backends #622
- Features the new GSAOnDevice sparse attention algorithm, enabling high-performance HBM utilization across both CUDA and Ascend platforms.#647 #638
- Aligned CacheBlend with the new UCM storage and sparse engine updates to support vLLM 0.9.2. #664
Known Issues
- Layerwise is not supported when using vllm 0.11.0
- Currently, installing with
pip install uc-managerdoes not support using vllm 0.11.0. - If you need to use vLLM 0.11.0+ with UCM layerwise, please refer to vllm-project/vllm#26675 for modifications.
- Currently, installing with
What's Changed
- [bugfix]cherry-pick from 0.2.0release Fix KeyError by @qyh111 in #573
- [bugfix] cherry-pick from 0.2.0release patch update by @wangwenxin0312 in #574
- [fix]cherry pick from 0.2.0-release fix monitor issue (#572) by @qyh111 in #575
- [bugfix] build hamming dist by @wangwenxin0312 in #577
- [feat]Update data file layout to adapt to garbage collection by @qyh111 in #579
- [bugfix]cherry pick from 0.2.0-release sparse patch & cmake by @wangwenxin0312 in #581
- [bugfix] kvcomp config by @wangwenxin0312 in #584
- [feat] KvCompOnDevice: per-KV-head Top-K for Qwen by @wangwenxin0312 in #589
- feature for triton rerope by @xinSky00 in #497
- [bugfix] kvcomp for qwen by @wangwenxin0312 in #594
- [bugfix] share buffer used out (cherry-picked from #592) by @mag1c-h in #598
- [fix]cherry-pick clean code and set local_rank_size to tp_size (#596) by @qyh111 in #600
- [misc] split dependency preparation logic into individual dependency files for enhanced configuration flexibility by @mag1c-h in #597
- [fix]fix clean code (#601) by @qyh111 in #602
- Modify blend and rerope docs by @xinSky00 in #593
- [docs] Modify blend introduction by @wuhuxiao in #605
- add qiongwu as codeownner by @Infinite666 in #610
- KVComp in NPU -- HBM version by @leideng in #599
- [bugfix] bugfix in PCStore, cherry-pick from release by @mag1c-h in #609
- [docs]Add doc for pipeline store by @qyh111 in #607
- [fix] remove request_succeed_dumped_blocks() in monkey patch by @xinSky00 in #613
- [fix]Sync changes from the release branch to develop. including docs、version and dockerfile by @qyh111 in #621
- [feat] Cherry-pick updates from 0.2.0-release to develop (patches and docs) by @wangwenxin0312 in #623
- [bugfix] ] Cherry-pick updates from 0.2.0-release (hamming compile) by @wangwenxin0312 in #625
- [doc]rename pipline_store to pipeline_store by @qyh111 in #626
- [bugfix] fix register_kv_caches patch by @Clarence-1103 in #629
- Unify xSA name as GSA by @leideng in #631
- [Feature] 3FS Store by @UESTC-AHao in #622
- [optimize]Optimized LLMPerf Test Cases by @Potterluo in #634
- [Doc] 3FS Document by @UESTC-AHao in #637
- [Feat] Basic scripts for deployment best practices by @sumingZero in #556
- [feature]Add LLM connection base components and OpenAI connector by @Potterluo in #636
- [Bugfix] Fix 3FS by @UESTC-AHao in #650
- [feat] PipelineStore Architecture Refresh and Capability Enhancement by @mag1c-h in #653
- [doc] Add contributing guide by @yuanzhg078 in #648
- [doc]Implement the function of a kv cache calculator html in User Guide by @Potterluo in #652
- [Opt] New gsa config by @leideng in #646
- [Feat] Support C++/Python to use same metrics singleton within a process by @flesher0813 in #654
- [feat]Add Layerwise Connector by @qyh111 in #656
- [Fix] Modify ucm_connector to adapt metrics by @flesher0813 in #658
- [doc] Update quickstart section in README_zh by @yuanzhg078 in #663
- [Feat] Update sparse method patches for vllm 0.11.0 by @AooooooA-C in #638
- [CI] add pr gate workflow by @dante159753 in #662
- [Opt] Gsa npu performance optimize by @leideng in #647
- [misc] Reduce gpu utilization to 6GB in test for 1.5B model by @dante159753 in #665
- [feat] add monkey patch for gsa on device v0.9.2 by @Clarence-1103 in #618
- [Fix] coredump if add new c++ metrics by @flesher0813 in #666
- [opt] adapt cache blend for store and sparse's new version by @wuhuxiao in #664
- [Doc] Update documents related to sparse. by @AooooooA-C in #672
- [CI] use requirements file to prepare test env by @dante159753 in #673
- [test]Evaluate model performance and accuracy with UCM by @ayaka836 in #642
- [Fix] Failed to start vLLM service using multi-node launch scripts under CUDA data parallelism by @sumingZero in #670
- [CI] remove logger, check branch up-to-date, fast fail e2e test by @dante159753 in #674
- release 0.3.0 by @flesher0813 in #677
- [bugfix] Fix compilation error due to missing atomic include by @harrisonyhq in #693
- [Bugfix] Modify worker_id set to separate different worker by @flesher0813 in #691
- [bugfix] rm unavailable lib and fix doc and update patch by @wuhuxiao in #700
- [perf] Reduce directory lock conflicts during batch dumps in PosixStore by @mag1c-h in #711
New Contributors
- @Infinite666 made their first contribution in #610
- @dante159753 made their first contribution in #662
- @ayaka836 made their first contribution in #642
Full Changelog: v0.2.0...v0.3.0