deepjavalibrary · ethnzhng · Dec 16, 2025 · Dec 15, 2025 · Dec 15, 2025 · Dec 16, 2025
@@ -175,7 +175,7 @@ jobs:
           IMAGE_REPO: ${{ inputs.image-repo }}
           CONTAINER: "lmi"
         run: |
-          SERVING_VERSION=${TEST_SERVING_VERSION:-"0.35.0"}
+          SERVING_VERSION=${TEST_SERVING_VERSION:-"0.36.0"}
           SERVING_VERSION=$(echo $SERVING_VERSION | xargs) # trim whitespace
 
           if [ -n "$OVERRIDE_TEST_CONTAINER" ]; then

@@ -2,8 +2,8 @@
 format.version = "1.1"
 
 [versions]
-djl = "0.35.0"
-serving = "0.35.0"
+djl = "0.36.0"
+serving = "0.36.0"
 onnxruntime = "1.20.0"
 commonsCli = "1.9.0"
 commonsCodec = "1.18.0"

@@ -0,0 +1,98 @@
+# LMCache Performance Benefits for LMI Customers
+
+LMCache is a KV cache offloading solution that dramatically improves inference performance for large language models serving workloads with repeated context. By offloading KV cache from GPU memory to CPU RAM or NVMe storage, LMCache enables efficient handling of long-context scenarios while delivering substantial latency improvements. LMCache support has been integrated into LMI since v17.
+
+## Key Performance Improvements
+
+Based on extensive testing across model sizes and context lengths, LMCache delivers **exceptional performance gains**:
+
+* **CPU offloading**: Up to **28x speedup in TTFT** (achieved with Qwen 2.5-7B at 2M token context length)
+* **NVMe-based offloading**: Up to **16x speedup in TTFT** (achieved with Qwen 2.5-72B at 1M token context length using O_DIRECT)
+
+Our specific benchmarking on Qwen 8B (serving 460K tokens across 46 documents) demonstrates:
+
+* **Time to First Token (TTFT)**: Reduced from 1.161s to 0.438s with CPU offloading (**2.65x faster**)
+* **Total Request Latency**: Reduced from 52.978s to 24.274s (**2.18x faster**)
+
+## Cache Backend Comparison
+
+### CPU RAM Offloading (Recommended for Latency-Critical Workloads)
+
+* **Best Performance**: Up to **28x** speedup in TTFT (at 2M tokens with Qwen 2.5-7B), fastest query TTFT (0.437s in benchmark)
+* **Use Case**: Latency-sensitive applications requiring immediate response
+* **Limitation**: Constrained by instance RAM capacity (e.g., 1.1TB on p4de.24xlarge)
+
+### NVMe Storage with O_DIRECT
+
+* **Strong Performance**: Up to **16x** speedup in TTFT (at 1M tokens with Qwen 2.5-72B), query TTFT of 0.731s (approaching CPU performance)
+* **Massive Capacity**: Supports TB-scale caching for extensive document collections
+* **Use Case**: Large-scale deployments with substantial context requirements
+* **Configuration**: Enable `use_odirect: True` for optimal performance
+
+## When to Use LMCache
+
+The value of LMCache depends on your model size, context length requirements and GPU memory:
+
+For a p4de.24xlarge -
+
+|Model Size	|Context Length	|Recommendation	|
+|---	|---	|---	|
+|~1B	|< 2M	|No offloading needed	|
+|	|2M – 30M	|Consider CPU offloading	|
+|	|> 25M	|Consider Disk offloading	|
+|~7-10B	|< 250K	|No offloading needed	|
+|	|250K – 8M	|Consider CPU offloading	|
+|	|> 6M	|Consider Disk offloading	|
+|~70B+	|< 250K	|No offloading needed	|
+|	|250K – 3M	|Consider CPU offloading	|
+|	|> 2.5M	|Consider Disk offloading	|
+
+**Key Insight**: Larger models benefit from LMCache at shorter context lengths because they consume more memory per token. A 72B model requires offloading around 500K tokens, while a 1.5B model only needs it beyond 2.5M tokens.
+
+## How to use LMCache
+
+ENV Variables should be set as shown here
+
+```
+OPTION_LMCACHE_CONFIG_FILE=/opt/ml/model/lmcache_config.yaml
+OPTION_KV_TRANSFER_CONFIG={"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}
+```
+
+lmcache_config.yaml changes as per the backend. But if any of the backends are not configured correctly, vLLM defaults to CPU offloading. vLLM does not currently support specifying advanced LMCache configuration as ENV variables.
+
+### On-Host RAM offloading
+
+```
+# lmcache_config.yaml
+# 256 Tokens per KV Chunk
+**chunk_size**: 256
+# 5GB of Pinned CPU memory
+**max_local_cpu_size**: 5.0 # Changes with Model Size
+```
+
+### On-Host NVME offloading
+
+```
+*# lmcache_config.yaml*
+*# 256 Tokens per KV Chunk*
+**chunk_size**: 256
+*# Enable Disk backend*
+**local_disk**: "file://tmp/cache/" # Fixed for SM customers
+*# 5GB of Disk memory*
+**max_local_disk_size**: 5.0 # Changes with Model Size
+*# Disable OS page cache in favor of CPU Pinned Memory*
+**extra_config**: {'use_odirect': True} # Fixed for SM customers
+# 5GB of Pinned CPU memory
+**max_local_cpu_size**: 5.0 # default
+```
+
+## Deployment Recommendations
+
+1. **Configure CPU offloading** when instance RAM permits—it delivers optimal performance
+2. **Use NVMe with O_DIRECT enabled** for workloads requiring larger cache capacity
+3. **Implement session-based sticky routing** on SageMaker Classic to maximize cache hit rates
+4. **Consider model architecture**: Models with different KV head configurations (e.g., Llama 3 8B vs Qwen 2.5-7B) will have different offloading thresholds
+
+## Performance Validation
+
+LMI container with LMCache demonstrates **performance parity** with open-source vLLM LMCache, ensuring enterprise customers receive the same optimization benefits with production-grade support and integration.
@@ -3,10 +3,39 @@
 Below are the release notes for recent Large Model Inference (LMI) images for use on SageMaker.
 For details on historical releases, refer to the [Github Releases page](https://github.com/deepjavalibrary/djl-serving/releases).
 
-## LMI V17 (DJL-Serving 0.35.0)
+## LMI V18 (DJL-Serving 0.36.0)
 
 Meet your brand new image! 💿
 
+#### LMI (vLLM) Image – 12-15-2025
+```
+763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi18.0.0-cu128
+```
+* vLLM has been upgraded to `0.12.0`
+* LMCache support for on-host caching of KV cache delivers up to 20x improvement to request latency for long context requests. Refer to [LMCache Performance Benefits for LMI Customers](../lmcache_performance.md) for more details
+* Added support for adapter-scoped custom code (e.g., model.py) that can be registered dynamically via the adapter management APIs, enabling per-adapter input/output formatting for multi-tenant LoRA deployments
+
+##### Key Features
+
+**Enhanced Adapter Management with Custom Code Support**
+* On adapter registration, DJL Serving now checks the adapter directory for `model.py` and (if present) loads the adapter's custom formatters before registering adapter weights
+* If adapter custom code loading fails, registration fails fast (adapter weights are not registered) and returns an error response (code 424)
+* During inference, adapter-specific formatters override base model formatters when the inference targets an adapter
+* On adapter unregistration, the adapter's custom code is unloaded/cleaned up
+* Enables per-adapter input/output formatting for multi-tenant LoRA deployments
+
+**LMCache Performance Improvements**
+* Up to 28x speedup in Time to First Token (TTFT) with CPU offloading (achieved with Qwen 2.5-7B at 2M token context length)
+* Up to 16x speedup in TTFT with NVMe-based offloading (achieved with Qwen 2.5-72B at 1M token context length using O_DIRECT)
+* Comprehensive benchmarking suite across different storage backends (CPU RAM, NVMe, Redis, S3, EBS)
+
+**Security & Stability**
+* Enhanced security validation for adapters in Secure Mode plugin
+* Improved multimodal integration test stability with vLLM 0.12.0
+* Updated CI/CD pipeline to use serving version consistently across workflows
+
+## LMI V17 (DJL-Serving 0.35.0)
+
 #### LMI (vLLM) Image – 9-30-2025
 ```
 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.35.0-lmi17.0.0-cu128

@@ -665,16 +665,24 @@
         "option.tensor_parallel_degree": 4,
     },
     "llama3-8b-lmcache-s3": {
-        "option.model_id": "s3://djl-llm/llama-3-8b-instruct-hf/",
-        "option.tensor_parallel_degree": 4,
-        "lmcache_config_file": "lmcache_s3.yaml",
-        "option.kv_transfer_config": '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
+        "option.model_id":
+        "s3://djl-llm/llama-3-8b-instruct-hf/",
+        "option.tensor_parallel_degree":
+        4,
+        "lmcache_config_file":
+        "lmcache_s3.yaml",
+        "option.kv_transfer_config":
+        '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
     },
     "llama3-8b-lmcache-redis": {
-        "option.model_id": "s3://djl-llm/llama-3-8b-instruct-hf/",
-        "option.tensor_parallel_degree": 4,
-        "lmcache_config_file": "lmcache_redis.yaml",
-        "option.kv_transfer_config": '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
+        "option.model_id":
+        "s3://djl-llm/llama-3-8b-instruct-hf/",
+        "option.tensor_parallel_degree":
+        4,
+        "lmcache_config_file":
+        "lmcache_redis.yaml",
+        "option.kv_transfer_config":
+        '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
     },
 }
 

@@ -29,8 +29,8 @@ class RequestStats:
     successful: bool
 
 
-async def send_djl_request(session, semaphore, base_url, prompt, output_len, 
-                          prompt_index, total_prompts):
+async def send_djl_request(session, semaphore, base_url, prompt, output_len,
+                           prompt_index, total_prompts):
     """Send a single async request to DJL's /invocations endpoint with streaming."""
     async with semaphore:
         start_time = time.time()
@@ -51,29 +51,31 @@ async def send_djl_request(session, semaphore, base_url, prompt, output_len,
 
         try:
             async with session.post(
-                f"{base_url}/invocations",
-                json=payload,
-                headers={"Content-Type": "application/json"}
-            ) as response:
+                    f"{base_url}/invocations",
+                    json=payload,
+                    headers={"Content-Type": "application/json"}) as response:
                 async for line in response.content:
                     if not line:
                         continue
 
                     line_str = line.decode('utf-8').strip()
-                    
+
                     if not line_str or line_str.startswith('data:'):
                         continue
 
                     try:
                         chunk_data = json.loads(line_str)
 
-                        if isinstance(chunk_data, dict) and 'choices' in chunk_data:
+                        if isinstance(chunk_data,
+                                      dict) and 'choices' in chunk_data:
                             choice = chunk_data['choices'][0]
                             content = None
 
-                            if 'delta' in choice and 'content' in choice['delta']:
+                            if 'delta' in choice and 'content' in choice[
+                                    'delta']:
                                 content = choice['delta']['content']
-                            elif 'message' in choice and 'content' in choice['message']:
+                            elif 'message' in choice and 'content' in choice[
+                                    'message']:
                                 content = choice['message']['content']
 
                             if content:
@@ -86,7 +88,7 @@ async def send_djl_request(session, semaphore, base_url, prompt, output_len,
 
             end_time = time.time()
             final_response = "".join(responses)
-            
+
             # Print complete request info
             print(f"\n[Request {prompt_index + 1}/{total_prompts}] "
                   f"Completed in {end_time - start_time:.2f}s")
@@ -115,24 +117,25 @@ async def send_djl_request(session, semaphore, base_url, prompt, output_len,
             )
 
 
-async def run_benchmark(base_url, model, prompts, output_len, max_inflight_requests):
+async def run_benchmark(base_url, model, prompts, output_len,
+                        max_inflight_requests):
     """Run benchmark with given prompts using asyncio."""
     # Create semaphore to limit concurrent requests
     semaphore = asyncio.Semaphore(max_inflight_requests)
-    
+
     # Create aiohttp session with no timeout
     timeout = aiohttp.ClientTimeout(total=None)
     async with aiohttp.ClientSession(timeout=timeout) as session:
         # Create all tasks
         tasks = [
-            send_djl_request(session, semaphore, base_url, prompt, output_len, 
-                           i, len(prompts))
-            for i, prompt in enumerate(prompts)
+            send_djl_request(session,
+                             semaphore, base_url, prompt, output_len, i,
+                             len(prompts)) for i, prompt in enumerate(prompts)
         ]
-        
+
         # Execute all tasks concurrently
         request_stats = await asyncio.gather(*tasks)
-    
+
     # Sort by prompt_id to maintain order
     request_stats = list(request_stats)
     request_stats.sort(key=lambda x: x.prompt_id)
@@ -175,8 +178,8 @@ async def main(args):
     pre_warmup_prompts = [
         str(i) + "xx" + " ".join(["hi"] * 1000) for i in range(5)
     ]
-    await run_benchmark(base_url, args.model, pre_warmup_prompts, args.output_len,
-                       args.max_inflight_requests)
+    await run_benchmark(base_url, args.model, pre_warmup_prompts,
+                        args.output_len, args.max_inflight_requests)
 
     # Prepare main prompts
     warmup_prompts = [
@@ -187,8 +190,8 @@ async def main(args):
     # Warmup round
     print("\n=== Warmup round ===")
     warmup_start_time = time.time()
-    warmup_request_stats = await run_benchmark(base_url, args.model, warmup_prompts,
-                                               args.output_len,
+    warmup_request_stats = await run_benchmark(base_url, args.model,
+                                               warmup_prompts, args.output_len,
                                                args.max_inflight_requests)
     warmup_end_time = time.time()
 
@@ -200,7 +203,8 @@ async def main(args):
 
     benchmark_start_time = time.time()
     benchmark_request_stats = await run_benchmark(base_url, args.model,
-                                                  query_prompts, args.output_len,
+                                                  query_prompts,
+                                                  args.output_len,
                                                   args.max_inflight_requests)
     benchmark_end_time = time.time()