Skip to content

[Ecosystem] LMCache #44

@nijaba

Description

@nijaba

Contact emails

@[email protected], Yihua Cheng
@[email protected], Jiayi Yao
@[email protected], Kuntai Du
@[email protected], Nick Barcet

Project summary

LLM serving engine extension that reduce TTFT and increase throughput

Project description

LMCache is an LLM serving engine extension that combines with vLLM or SGLang to reduce TTFT and increase throughput, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations, including (GPU, CPU DRAM, Local Disk, Databases), LMCache reuses the KV caches of any reused text (not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious GPU cycles and reduces user response delay.

By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Are there any other projects in the PyTorch Ecosystem similar to yours? If, yes, what are they?

No

Project repo URL

https://github.com/LMCache

Additional repos in scope of the application

None

Project license

Apache 2.0

GitHub handles of the project maintainer(s)

YaoJiayi, ApostaC, KuntaiDu, Shaoting-Feng, maobaolong, sammshen, hickeyma, HuaizhengZhang

Is there a corporate or academic entity backing this project? If so, please provide the name and URL of the entity.

TensorMesh.ai https://tensormesh.ai/ and University of Chicago https://cs.uchicago.edu

Website URL

https://lmcache.ai/

Documentation

https://docs.lmcache.ai/

How do you build and test the project today (continuous integration)? Please describe.

We are using the following to run our CI:

  • 2 L4 GPU servers on GCP
  • Default github runner

Our CI pipeline ensures:

  • Code quality check, triggered by each commit to the PR, hosted on github runners
  • Unit test uses buildkite, triggered by each commit to the PR, running on the L4 GPU servers
  • End-to-end correctness and performance, triggered once for each PR, running on the L4 GPU servers
  • Docker images and pip packages are built after merging each PR using github actions on default github runners

We have a 2 week release cadence

Version of PyTorch

  • We maintain compatibility with upstream vLLM through a connector API we contribute. Current version: 0.10.0
  • We support pytorch 2.2.0~2.8.0 (latest). Right now the release version is built with pytorch 2.7.1 (which is also required by vllm)

Components of PyTorch

  • Pytorch CUDA/CPP/Rocm extensions
  • Using standard pytorch python APIs (tensor-related APIs)

How long do you expect to maintain the project?

We are fully committed to maintaining this project for as long as we can see.

Additional information

LMCache is the result of research conducted at the University of Chicago by a team of researchers under the supervision of assistant professor Junchen Jiang.

Reference papers:

The project currently has 99 contributors and 8 maintainers which are from diverse organizations including IBM, Tencent, Bytedance, etc...

If the project is accepted as an ecosystem project, we will then propose it as a hosted project.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions