-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Contact emails
@[email protected], Yihua Cheng
@[email protected], Jiayi Yao
@[email protected], Kuntai Du
@[email protected], Nick Barcet
Project summary
LLM serving engine extension that reduce TTFT and increase throughput
Project description
LMCache is an LLM serving engine extension that combines with vLLM or SGLang to reduce TTFT and increase throughput, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations, including (GPU, CPU DRAM, Local Disk, Databases), LMCache reuses the KV caches of any reused text (not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious GPU cycles and reduces user response delay.
By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.
Are there any other projects in the PyTorch Ecosystem similar to yours? If, yes, what are they?
No
Project repo URL
Additional repos in scope of the application
None
Project license
Apache 2.0
GitHub handles of the project maintainer(s)
YaoJiayi, ApostaC, KuntaiDu, Shaoting-Feng, maobaolong, sammshen, hickeyma, HuaizhengZhang
Is there a corporate or academic entity backing this project? If so, please provide the name and URL of the entity.
TensorMesh.ai https://tensormesh.ai/ and University of Chicago https://cs.uchicago.edu
Website URL
Documentation
How do you build and test the project today (continuous integration)? Please describe.
We are using the following to run our CI:
- 2 L4 GPU servers on GCP
- Default github runner
Our CI pipeline ensures:
- Code quality check, triggered by each commit to the PR, hosted on github runners
- Unit test uses buildkite, triggered by each commit to the PR, running on the L4 GPU servers
- End-to-end correctness and performance, triggered once for each PR, running on the L4 GPU servers
- Docker images and pip packages are built after merging each PR using github actions on default github runners
We have a 2 week release cadence
Version of PyTorch
- We maintain compatibility with upstream vLLM through a connector API we contribute. Current version: 0.10.0
- We support pytorch 2.2.0~2.8.0 (latest). Right now the release version is built with pytorch 2.7.1 (which is also required by vllm)
Components of PyTorch
- Pytorch CUDA/CPP/Rocm extensions
- Using standard pytorch python APIs (tensor-related APIs)
How long do you expect to maintain the project?
We are fully committed to maintaining this project for as long as we can see.
Additional information
LMCache is the result of research conducted at the University of Chicago by a team of researchers under the supervision of assistant professor Junchen Jiang.
Reference papers:
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
- CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion
- Do Large Language Models Need a Content Delivery Network?
The project currently has 99 contributors and 8 maintainers which are from diverse organizations including IBM, Tencent, Bytedance, etc...
If the project is accepted as an ecosystem project, we will then propose it as a hosted project.