diff --git a/_posts/2025-10-29-cohere-coreweave-lmcache b/_posts/2025-10-29-cohere-coreweave-lmcache new file mode 100644 index 0000000..8fb9477 --- /dev/null +++ b/_posts/2025-10-29-cohere-coreweave-lmcache @@ -0,0 +1,118 @@ +--- +layout: post +title: "Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere" +thumbnail-img: /assets/img/async.png +share-img: /assets/img/async.png +author: Walter Beller-Morales (Cohere), Samuel Shen (Tensormesh), Kishor Aher (CoreWeave) +image: /assets/img/async.png +--- + +# **Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere** + +By Walter Beller-Morales (Cohere), Samuel Shen (Tensormesh), Kishor Aher (CoreWeave) + +### **The challenge: Scaling enterprise AI** + +Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. [**Cohere**](https://cohere.com), one of the leading enterprise AI companies, built its North platform to help organizations use their own internal data safely and effectively to power retrieval-augmented generation (RAG). North allows enterprises to ground model outputs in trusted, private knowledge bases, delivering accurate, contextual responses tailored to their business. + +When you use RAG, you are prefixing each request with the relevant contextual data so that the model will provide relevant answers. This approach introduces a computational hurdle as it adds large amounts of data which need to re-processed everytime a query is received as this information does not modify the weights of the model but only stores it in the KV cache which is a temporary memory which is typically discarded after the query is processed. The richer the context an LLM is given, the more **tokens** it must process, and behind every few tokens lies a growing [**Key and Value tensors**](https://medium.com/analytics-vidhya/understanding-q-k-v-in-transformer-self-attention-9a5eddaa5960) **(KV) cache** that stores intermediate model states. This cache is essential for generating coherent responses, but it grows rapidly with input length, consuming vast amounts of GPU or CPU memory. This is not specific to RAG: any additional prompt content (such as tool call arguments, code, or long instructions) also increases compute cost because it must be re-encoded on every request, but RAG is the use case in this blog. + +At scale, this creates a performance and cost bottleneck for inference. Even for efficient inference engines like vLLM. Cohere’s engineering team set out to solve this problem by exploring whether **KV caches could be stored remotely**, freeing up local memory without slowing down inference. + +That’s where **LMCache** and **CoreWeave AI Object Storage** come in. Together, they enable high-performance **remote KV caching**, allowing large models to handle long contexts with less memory pressure and better throughput. + +In this blog, we’ll examine recent Cohere benchmark tests which demonstrate how these technologies come together to **power some of the world’s most complex AI workloads**, achieving remarkable efficiency and scalability in real-world inference. + +### **Remote KV caching** + +To address the growing memory demands of large-scale inference, **LMCache** reimagines how language models manage and store their context. At the heart of every transformer-decoder based LLM lies the **Key and Value tensors (KV) cache**, the hidden state data that must be preserved across tokens so the model can maintain coherence in its output. As input length increases, this cache grows rapidly: every few tokens produce additional KV pairs that accumulate over time. The result is a steep rise in memory usage that quickly becomes a bottleneck, even for advanced inference platforms. + +LMCache solves this by implementing a remote KV cache architecture. Instead of keeping all cache data in GPU or CPU memory, LMCache serializes and stores it externally, retrieving it only when needed. This approach significantly reduces memory pressure on inference hardware, allowing for longer contexts, more simultaneous sessions, and better resource utilization, without sacrificing, and even improving, model performance. + +To make remote caching viable, storage must be both high-throughput and latency-tolerant. **CoreWeave AI Object Storage** provides exactly that foundation. Designed for AI workloads, CoreWeave AI Object Storage delivers multi-gigabyte-per-second bandwidth and resilient scalability across GPU clusters. It ensures that KV data can be offloaded, persisted, and fetched back at the speed modern inference demands. + +Together, LMCache and CoreWeave AI Object Storage form a tightly integrated system: LMCache handles cache serialization and coordination, while CoreWeave AI Object Storage provides the distributed performance backbone that makes external caching seamless. The result is a new, more flexible model of inference, one where context can grow without constraint and infrastructure can scale intelligently. + +### **Benchmark testing** + +To evaluate how well remote caching performs under real-world workloads, the LMCache team partnered with **Cohere** to benchmark its integration on the North platform, using CoreWeave AI Object Storage as the storage backend. The goal was to determine whether inference could remain fast and efficient even when the KV cache lives outside of GPU or CPU memory. + +The tests were conducted on [Cohere’s Command A model](https://cohere.com/blog/command-a), running on CoreWeave’s GPU infrastructure with the vLLM inference engine. Three configurations were compared: + +1. **Baseline:** Full prefill, with no KV cache reuse. + +2. **LMCache \+ CoreWeave AI Object Storage:** KV data serialized and stored on CoreWeave AI Object Storage, then retrieved on demand. + +3. **Alternative Object Storage (S3 Express):** Used as a performance baseline for cold and hot cache conditions. + +The benchmarks measured two key metrics: + +* **Time to First Token (TTFT):** How quickly the model generates its first token — dominated by the prefill phase. + +* **Decoding Throughput:** The number of tokens generated per second once prefill is complete. + +

+
+