Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/optimization_integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ jobs:
IMAGE_REPO: ${{ inputs.image-repo }}
CONTAINER: "lmi"
run: |
SERVING_VERSION=${TEST_SERVING_VERSION:-"0.35.0"}
SERVING_VERSION=${TEST_SERVING_VERSION:-"0.36.0"}
SERVING_VERSION=$(echo $SERVING_VERSION | xargs) # trim whitespace

if [ -n "$OVERRIDE_TEST_CONTAINER" ]; then
Expand Down
4 changes: 2 additions & 2 deletions gradle/libs.versions.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
format.version = "1.1"

[versions]
djl = "0.35.0"
serving = "0.35.0"
djl = "0.36.0"
serving = "0.36.0"
onnxruntime = "1.20.0"
commonsCli = "1.9.0"
commonsCodec = "1.18.0"
Expand Down
98 changes: 98 additions & 0 deletions serving/docs/lmcache_performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# LMCache Performance Benefits for LMI Customers

LMCache is a KV cache offloading solution that dramatically improves inference performance for large language models serving workloads with repeated context. By offloading KV cache from GPU memory to CPU RAM or NVMe storage, LMCache enables efficient handling of long-context scenarios while delivering substantial latency improvements. LMCache support has been integrated into LMI since v17.

## Key Performance Improvements

Based on extensive testing across model sizes and context lengths, LMCache delivers **exceptional performance gains**:

* **CPU offloading**: Up to **28x speedup in TTFT** (achieved with Qwen 2.5-7B at 2M token context length)
* **NVMe-based offloading**: Up to **16x speedup in TTFT** (achieved with Qwen 2.5-72B at 1M token context length using O_DIRECT)

Our specific benchmarking on Qwen 8B (serving 460K tokens across 46 documents) demonstrates:

* **Time to First Token (TTFT)**: Reduced from 1.161s to 0.438s with CPU offloading (**2.65x faster**)
* **Total Request Latency**: Reduced from 52.978s to 24.274s (**2.18x faster**)

## Cache Backend Comparison

### CPU RAM Offloading (Recommended for Latency-Critical Workloads)

* **Best Performance**: Up to **28x** speedup in TTFT (at 2M tokens with Qwen 2.5-7B), fastest query TTFT (0.437s in benchmark)
* **Use Case**: Latency-sensitive applications requiring immediate response
* **Limitation**: Constrained by instance RAM capacity (e.g., 1.1TB on p4de.24xlarge)

### NVMe Storage with O_DIRECT

* **Strong Performance**: Up to **16x** speedup in TTFT (at 1M tokens with Qwen 2.5-72B), query TTFT of 0.731s (approaching CPU performance)
* **Massive Capacity**: Supports TB-scale caching for extensive document collections
* **Use Case**: Large-scale deployments with substantial context requirements
* **Configuration**: Enable `use_odirect: True` for optimal performance

## When to Use LMCache

The value of LMCache depends on your model size, context length requirements and GPU memory:

For a p4de.24xlarge -

|Model Size |Context Length |Recommendation |
|--- |--- |--- |
|~1B |< 2M |No offloading needed |
| |2M – 30M |Consider CPU offloading |
| |> 25M |Consider Disk offloading |
|~7-10B |< 250K |No offloading needed |
| |250K – 8M |Consider CPU offloading |
| |> 6M |Consider Disk offloading |
|~70B+ |< 250K |No offloading needed |
| |250K – 3M |Consider CPU offloading |
| |> 2.5M |Consider Disk offloading |

**Key Insight**: Larger models benefit from LMCache at shorter context lengths because they consume more memory per token. A 72B model requires offloading around 500K tokens, while a 1.5B model only needs it beyond 2.5M tokens.

## How to use LMCache

ENV Variables should be set as shown here

```
OPTION_LMCACHE_CONFIG_FILE=/opt/ml/model/lmcache_config.yaml
OPTION_KV_TRANSFER_CONFIG={"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}
```

lmcache_config.yaml changes as per the backend. But if any of the backends are not configured correctly, vLLM defaults to CPU offloading. vLLM does not currently support specifying advanced LMCache configuration as ENV variables.

### On-Host RAM offloading

```
# lmcache_config.yaml
# 256 Tokens per KV Chunk
**chunk_size**: 256
# 5GB of Pinned CPU memory
**max_local_cpu_size**: 5.0 # Changes with Model Size
```

### On-Host NVME offloading

```
*# lmcache_config.yaml*
*# 256 Tokens per KV Chunk*
**chunk_size**: 256
*# Enable Disk backend*
**local_disk**: "file://tmp/cache/" # Fixed for SM customers
*# 5GB of Disk memory*
**max_local_disk_size**: 5.0 # Changes with Model Size
*# Disable OS page cache in favor of CPU Pinned Memory*
**extra_config**: {'use_odirect': True} # Fixed for SM customers
# 5GB of Pinned CPU memory
**max_local_cpu_size**: 5.0 # default
```

## Deployment Recommendations

1. **Configure CPU offloading** when instance RAM permits—it delivers optimal performance
2. **Use NVMe with O_DIRECT enabled** for workloads requiring larger cache capacity
3. **Implement session-based sticky routing** on SageMaker Classic to maximize cache hit rates
4. **Consider model architecture**: Models with different KV head configurations (e.g., Llama 3 8B vs Qwen 2.5-7B) will have different offloading thresholds

## Performance Validation

LMI container with LMCache demonstrates **performance parity** with open-source vLLM LMCache, ensuring enterprise customers receive the same optimization benefits with production-grade support and integration.
31 changes: 30 additions & 1 deletion serving/docs/lmi/release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,39 @@
Below are the release notes for recent Large Model Inference (LMI) images for use on SageMaker.
For details on historical releases, refer to the [Github Releases page](https://github.com/deepjavalibrary/djl-serving/releases).

## LMI V17 (DJL-Serving 0.35.0)
## LMI V18 (DJL-Serving 0.36.0)

Meet your brand new image! 💿

#### LMI (vLLM) Image – 12-15-2025
```
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi18.0.0-cu128
```
* vLLM has been upgraded to `0.12.0`
* LMCache support for on-host caching of KV cache delivers up to 20x improvement to request latency for long context requests. Refer to [LMCache Performance Benefits for LMI Customers](../lmcache_performance.md) for more details
* Added support for adapter-scoped custom code (e.g., model.py) that can be registered dynamically via the adapter management APIs, enabling per-adapter input/output formatting for multi-tenant LoRA deployments

##### Key Features

**Enhanced Adapter Management with Custom Code Support**
* On adapter registration, DJL Serving now checks the adapter directory for `model.py` and (if present) loads the adapter's custom formatters before registering adapter weights
* If adapter custom code loading fails, registration fails fast (adapter weights are not registered) and returns an error response (code 424)
* During inference, adapter-specific formatters override base model formatters when the inference targets an adapter
* On adapter unregistration, the adapter's custom code is unloaded/cleaned up
* Enables per-adapter input/output formatting for multi-tenant LoRA deployments

**LMCache Performance Improvements**
* Up to 28x speedup in Time to First Token (TTFT) with CPU offloading (achieved with Qwen 2.5-7B at 2M token context length)
* Up to 16x speedup in TTFT with NVMe-based offloading (achieved with Qwen 2.5-72B at 1M token context length using O_DIRECT)
* Comprehensive benchmarking suite across different storage backends (CPU RAM, NVMe, Redis, S3, EBS)

**Security & Stability**
* Enhanced security validation for adapters in Secure Mode plugin
* Improved multimodal integration test stability with vLLM 0.12.0
* Updated CI/CD pipeline to use serving version consistently across workflows

## LMI V17 (DJL-Serving 0.35.0)

#### LMI (vLLM) Image – 9-30-2025
```
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.35.0-lmi17.0.0-cu128
Expand Down
24 changes: 16 additions & 8 deletions tests/integration/llm/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -665,16 +665,24 @@
"option.tensor_parallel_degree": 4,
},
"llama3-8b-lmcache-s3": {
"option.model_id": "s3://djl-llm/llama-3-8b-instruct-hf/",
"option.tensor_parallel_degree": 4,
"lmcache_config_file": "lmcache_s3.yaml",
"option.kv_transfer_config": '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
"option.model_id":
"s3://djl-llm/llama-3-8b-instruct-hf/",
"option.tensor_parallel_degree":
4,
"lmcache_config_file":
"lmcache_s3.yaml",
"option.kv_transfer_config":
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
},
"llama3-8b-lmcache-redis": {
"option.model_id": "s3://djl-llm/llama-3-8b-instruct-hf/",
"option.tensor_parallel_degree": 4,
"lmcache_config_file": "lmcache_redis.yaml",
"option.kv_transfer_config": '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
"option.model_id":
"s3://djl-llm/llama-3-8b-instruct-hf/",
"option.tensor_parallel_degree":
4,
"lmcache_config_file":
"lmcache_redis.yaml",
"option.kv_transfer_config":
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}',
},
}

Expand Down
50 changes: 27 additions & 23 deletions tests/integration/lmcache_configs/djl_long_doc_qa_clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ class RequestStats:
successful: bool


async def send_djl_request(session, semaphore, base_url, prompt, output_len,
prompt_index, total_prompts):
async def send_djl_request(session, semaphore, base_url, prompt, output_len,
prompt_index, total_prompts):
"""Send a single async request to DJL's /invocations endpoint with streaming."""
async with semaphore:
start_time = time.time()
Expand All @@ -51,29 +51,31 @@ async def send_djl_request(session, semaphore, base_url, prompt, output_len,

try:
async with session.post(
f"{base_url}/invocations",
json=payload,
headers={"Content-Type": "application/json"}
) as response:
f"{base_url}/invocations",
json=payload,
headers={"Content-Type": "application/json"}) as response:
async for line in response.content:
if not line:
continue

line_str = line.decode('utf-8').strip()

if not line_str or line_str.startswith('data:'):
continue

try:
chunk_data = json.loads(line_str)

if isinstance(chunk_data, dict) and 'choices' in chunk_data:
if isinstance(chunk_data,
dict) and 'choices' in chunk_data:
choice = chunk_data['choices'][0]
content = None

if 'delta' in choice and 'content' in choice['delta']:
if 'delta' in choice and 'content' in choice[
'delta']:
content = choice['delta']['content']
elif 'message' in choice and 'content' in choice['message']:
elif 'message' in choice and 'content' in choice[
'message']:
content = choice['message']['content']

if content:
Expand All @@ -86,7 +88,7 @@ async def send_djl_request(session, semaphore, base_url, prompt, output_len,

end_time = time.time()
final_response = "".join(responses)

# Print complete request info
print(f"\n[Request {prompt_index + 1}/{total_prompts}] "
f"Completed in {end_time - start_time:.2f}s")
Expand Down Expand Up @@ -115,24 +117,25 @@ async def send_djl_request(session, semaphore, base_url, prompt, output_len,
)


async def run_benchmark(base_url, model, prompts, output_len, max_inflight_requests):
async def run_benchmark(base_url, model, prompts, output_len,
max_inflight_requests):
"""Run benchmark with given prompts using asyncio."""
# Create semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(max_inflight_requests)

# Create aiohttp session with no timeout
timeout = aiohttp.ClientTimeout(total=None)
async with aiohttp.ClientSession(timeout=timeout) as session:
# Create all tasks
tasks = [
send_djl_request(session, semaphore, base_url, prompt, output_len,
i, len(prompts))
for i, prompt in enumerate(prompts)
send_djl_request(session,
semaphore, base_url, prompt, output_len, i,
len(prompts)) for i, prompt in enumerate(prompts)
]

# Execute all tasks concurrently
request_stats = await asyncio.gather(*tasks)

# Sort by prompt_id to maintain order
request_stats = list(request_stats)
request_stats.sort(key=lambda x: x.prompt_id)
Expand Down Expand Up @@ -175,8 +178,8 @@ async def main(args):
pre_warmup_prompts = [
str(i) + "xx" + " ".join(["hi"] * 1000) for i in range(5)
]
await run_benchmark(base_url, args.model, pre_warmup_prompts, args.output_len,
args.max_inflight_requests)
await run_benchmark(base_url, args.model, pre_warmup_prompts,
args.output_len, args.max_inflight_requests)

# Prepare main prompts
warmup_prompts = [
Expand All @@ -187,8 +190,8 @@ async def main(args):
# Warmup round
print("\n=== Warmup round ===")
warmup_start_time = time.time()
warmup_request_stats = await run_benchmark(base_url, args.model, warmup_prompts,
args.output_len,
warmup_request_stats = await run_benchmark(base_url, args.model,
warmup_prompts, args.output_len,
args.max_inflight_requests)
warmup_end_time = time.time()

Expand All @@ -200,7 +203,8 @@ async def main(args):

benchmark_start_time = time.time()
benchmark_request_stats = await run_benchmark(base_url, args.model,
query_prompts, args.output_len,
query_prompts,
args.output_len,
args.max_inflight_requests)
benchmark_end_time = time.time()

Expand Down
Loading
Loading