Skip to content

Releases: ModelEngine-Group/unified-cache-management

v0.2.0

05 Jan 12:28
39d46c7

Choose a tag to compare

Hightlights

  • Support Model Window Extrapolation:Rectified Rotary Position Embeddings (ReRoPE)(#497)
  • Support sparse attention algorithms in HBM on both CUDA GPUs and Ascend NPUs. It sparsifies attention by hashing KV states and using Hamming distance Top-K selection.(#559)(#599)
  • Add Pipeline Store composed of Cache Store and POSIX Store(#553).
  • Improved KV cache transfer performance for NfsStore.(#393)

Known Issues

  • Sparse is not supported when installing via pip
    • Currently, installing with pip install uc-manager does not support Sparse.
    • Before installing via pip, please make sure to set the platform explicitly:
      export PLATFORM=xxx
    • To use Sparse, please install via the Docker image or build from source.

What's Changed

New Contributors

Full Changelog: v0.1.2...v0.2.0

v0.2.0rc1

13 Dec 13:43
bad9354

Choose a tag to compare

v0.2.0rc1 Pre-release
Pre-release

Hightlights

  • Improved Prefix Cache offload/load performance.
  • Support Cache Blend.

Core:

  • Support Cache Blend in (#467)
  • Add V1 Store Interface in (#510) and (#518)

Known Issues

  • When using the Ascend platform:
    • Broadcasting is not supported.
    • load_only_first_rank must be set to false in the configuration.
  • When compiling from source, make sure to set the PLATFORM environment variable.

What's Changed

New Contributors

Full Changelog: v0.1.2...v0.2.0rc1

v0.1.2

10 Dec 07:56
aa31619

Choose a tag to compare

Some small fixes in this release.

  • [Docs] Documents are now easier to read.
  • [Docs] PD disaggregation documentation update : Update the PD disaggregation documentation to remove the --enforce-eager argument when starting the vllm service, so that graph mode is enabled by default at startup.
  • [Feat] Completely remove UCconnector, please use UCMConnector from now on.
  • [Feat] UCM supports recovery form load failure:Implement the get_block_ids_with_load_errors interface in the KVConnectorBase_V1 class, enabling vLLM to reexecute inference for requests whose KV cache failed to load from UCM.
  • [Build] Use pip install uc-manager==0.1.2 and the install will build from source for both vllm and vllm-ascend.
  • [Build] Sparse module are now built and used only if set environment variable export ENABLE_SPARSE=TRUE.

What's Changed

New Contributors

Full Changelog: v0.1.0...v0.1.2

v0.1.0

02 Dec 08:42
5ba2684

Choose a tag to compare

We are excited to announce the first official release of Unified Cache Manager.

Hightlights

  • Offload Prefix Cache to storage.
  • Homogeneous/ Heterogeneos PD disaggregation.
  • Training-Free sparsity in accelerating inference.(vllm==0.9.2, vllm-ascend==0.9.2rc1)in #199, #335, #190, #451

Core:

  • Garbage collection for store in #315 and #312
  • Adapt to vllm and vllm-ascend in #13, #292, #415 and #362
  • UCM supports metrics display online via Grafana and Promethues in #414 and docs in #416

Known Issues

If using Ascend platform, please be mind of

  • not compatible with broadcast
  • load_only_first_rank: false in config

Others

  • Update documents
  • Tools for performance tuning, hyperparameter optimization in #418

What's Changed

New Contributors

Full Changelog: v0.1.0rc4...v0.1.0

v0.1.0rc4

22 Nov 10:16
5779ce9

Choose a tag to compare

v0.1.0rc4 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.1.0rc2...v0.1.0rc4

v0.1.0rc2

19 Nov 08:01
16ed5da

Choose a tag to compare

v0.1.0rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.1.0rc1...v0.1.0rc2

v0.1.0rc1

17 Nov 12:21
754f7ba

Choose a tag to compare

v0.1.0rc1 Pre-release
Pre-release

Support Features

  • Prefix Cache
  • Sparse Attention
  • Sparse Attention Offload
  • PD Disaggregation

What's Changed

Read more