Skip to content

Commit 0ee276d

Browse files
author
hazem
committed
feat: Replace legacy spillover logic with Waterfall LRU architecture
This is a major architectural upgrade to the core benchmark logic. Replacing the original "Spillover" memory management strategy with the new "Waterfall LRU" implementation to accurately simulate enterprise storage hierarchies. Key Changes: - Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe). New data now correctly lands in the fastest available tier, pushing cold data down, rather than the old behavior where new data skipped directly to NVMe if RAM was full. - Static Buffer Optimization: Replaced the CPU-bound np.random generation with a pre-allocated static noise buffer. This removes the CPU bottleneck that was masking true storage latency, allowing us to fully saturate high-performance NVMe drives. - Concurrency Hardening: Added semaphore-based concurrency limits (max_concurrent_allocs) and atomic memory reservations to prevent OOM crashes under heavy load. - Storage Metrics: Added explicit tracking for nvme_tokens_processed to calculate true storage throughput separate from system throughput. - Stress Test Validation: Verified that this new architecture correctly exposes storage latency limits (e.g., pushing P95 write latency >1000ms) where the old script artificially throttled the load.
1 parent 0c2561d commit 0ee276d

File tree

3 files changed

+1677
-1119
lines changed

3 files changed

+1677
-1119
lines changed

kv_cache_benchmark/MLperf v3 KV cache proposal.md

Lines changed: 205 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -529,6 +529,75 @@ The benchmark copies that pattern with three simple pieces:
529529

530530
In the summary you will see both numbers. A high reuse count with few hits simply says the prompt was detected but the stored copy had already been evicted, just like what operators watch for in production.
531531

532+
### J. ShareGPT Replay: Realistic Workload Simulation
533+
534+
While synthetic workloads (using random token counts within a range) are excellent for controlled stress testing, they may not fully capture the nuances of human-AI interaction. The **ShareGPT Replay** feature addresses this by loading real conversation trees from the ShareGPT dataset.
535+
536+
**How it works:**
537+
1. **Ingestion:** The `ShareGPTDatasetLoader` parses a JSON dataset of real conversations. It uses a tokenizer to calculate the exact `context_tokens` (user prompt) and `generate_tokens` (model response) for every turn.
538+
2. **Replay:** Instead of generating random requests, the benchmark feeds these real token counts into the `InferenceRequest` queue.
539+
3. **Structure Preservation:** Crucially, it preserves the multi-turn structure of the data. Request 2 is guaranteed to be a follow-up to Request 1, testing the `MultiTierCache`'s ability to handle real conversational locality.
540+
541+
**Case Study: Analyzing ShareGPT Results**
542+
Running a replay with the `llama3.1-70b-instruct` model on a memory-constrained system (2GB CPU RAM) reveals bottlenecks often hidden by uniform random distributions.
543+
544+
* **High Cache Hit Rate (97.2%):** Real conversations exhibit high locality. Users ask follow-up questions, allowing the system to reuse the KV cache effectively.
545+
* **NVMe Read Latency Spikes (291ms P95):** Unlike synthetic tests which might average around a mean, real user inputs vary wildly. A single request with a 16k token context can saturate the read bandwidth, pushing the P95 latency above the 200ms target, resulting in a "FAIL" assessment for storage even if throughput is high.
546+
547+
**Sample Output Summary:**
548+
```text
549+
### STORAGE PERFORMANCE ASSESSMENT: FAIL ✗ ###
550+
Criteria Passed: 3/4
551+
✓ NVMe Write P95 < 500ms: 54.50ms
552+
✗ NVMe Read P95 < 200ms: 291.11ms (Target: 200ms)
553+
✓ Cache Hit Rate > 30%: 97.2%
554+
555+
### CACHE TIER DISTRIBUTION ###
556+
GPU Entries: 0 (0.00 GB)
557+
CPU Entries: 156 (1.60 GB)
558+
NVMe Entries: 1772 (92% of cache on slow storage)
559+
```
560+
561+
### K. The Importance of Realism: A Comparative Case Study
562+
563+
To illustrate why workload realism matters, we compared two runs of the benchmark on identical hardware (50 users, 70B model, NVMe-only cache).
564+
565+
**Run A: Real Workload (ShareGPT)**
566+
This run uses the actual conversation data, reflecting human usage patterns.
567+
```bash
568+
python3 kv-cache_sharegpt_replay.py \
569+
--model llama3.1-70b-instruct \
570+
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
571+
--gpu-mem-gb 0 --cpu-mem-gb 2 --cache-dir /mnt/nvme \
572+
--num-users 50 --duration 300 --generation-mode none
573+
```
574+
575+
**Run B: Synthetic Workload (Random)**
576+
This run omits the dataset, causing the benchmark to fall back to generating random, full-length contexts. This represents a "worst-case" scenario (e.g., massive document processing) rather than a chat workload.
577+
```bash
578+
python3 kv-cache_sharegpt_replay.py \
579+
--model llama3.1-70b-instruct \
580+
--gpu-mem-gb 0 --cpu-mem-gb 2 --cache-dir /mnt/nvme \
581+
--num-users 50 --duration 300 --generation-mode none
582+
```
583+
584+
The results were dramatically different:
585+
586+
| Metric | Run A: ShareGPT (Real) | Run B: Synthetic (Random) | Difference |
587+
| :--- | :--- | :--- | :--- |
588+
| **Workload Type** | Human Conversations | Random Large Contexts | |
589+
| **Mean Context Size** | **133 tokens** (~41 MB) | **2,676 tokens** (~836 MB) | **20x Larger Data** |
590+
| **Throughput** | **2,610 tok/sec** | **362 tok/sec** | **7.2x Slower** |
591+
| **NVMe Read P95** | **291 ms** | **6,752 ms** (6.7s) | **23x Slower** |
592+
| **End-to-End P50** | 93 ms | 121,158 ms (2 min) | **System Collapse** |
593+
594+
**Key Findings:**
595+
1. **Context Size Explosion:** Real human queries are concise (avg 133 tokens). The synthetic generator, aiming for coverage, produced contexts averaging 2,676 tokens. This forced the storage system to read/write **20x more data per request** in the synthetic run.
596+
2. **System Collapse:** In the synthetic run, the P50 end-to-end latency ballooned to **2 minutes**, while the storage latency was only ~4 seconds. This indicates the system was in a state of **thrashing**, where requests spent 95% of their time waiting in the queue because the storage was saturated handling massive files.
597+
3. **Cache Efficiency:** Real conversations have high locality (85.9% multi-turn hit rate) because users ask follow-up questions. The synthetic run had a much lower hit rate (60.1%), further stressing the storage.
598+
599+
**Conclusion:** Run A represents a realistic chatbot application, where the NVMe drive is nearly sufficient. Run B represents a worst-case scenario, proving that for such heavy workloads, the current hardware configuration is inadequate.
600+
532601
---
533602

534603
## 6. Current Work: Validating Simulation Accuracy with vLLM
@@ -644,16 +713,16 @@ Two primary scenarios should be submitted to give a comprehensive view of storag
644713

645714
#### Standard Submission: `llama3.1-8b`
646715

647-
This workload provides a baseline for storage performance under typical conditions. A fixed seed is required to ensure the workload is identical for all submissions, enabling fair and reproducible comparisons.
716+
This workload provides a baseline for storage performance under typical conditions. **Note:** We set `cpu-mem-gb 0` to disable the caching tier entirely, forcing every token to hit the NVMe drive. This ensures the benchmark measures the storage hardware, not the OS file cache.
648717

649718
```bash
650719
# MLPerf v3.0 Recommended Invocation: Storage Saturation Test (8B Model)
651-
python3 kv-cache.py \
720+
python3 kv-cache-waterfall-lru.py \
652721
--model llama3.1-8b \
653722
--num-users 150 \
654723
--duration 600 \
655724
--gpu-mem-gb 0 \
656-
--cpu-mem-gb 2 \
725+
--cpu-mem-gb 0 \
657726
--generation-mode realistic \
658727
--performance-profile throughput \
659728
--seed 42 \
@@ -666,18 +735,21 @@ This workload tests the storage's ability to handle a much heavier load, as the
666735

667736
```bash
668737
# MLPerf v3.0 Recommended Invocation: Storage Saturation Test (70B Model)
669-
python3 kv-cache.py \
738+
python3 kv-cache-waterfall-lru.py \
670739
--model llama3.1-70b-instruct \
671740
--num-users 40 \
672741
--duration 600 \
673742
--gpu-mem-gb 0 \
674-
--cpu-mem-gb 4 \
743+
--cpu-mem-gb 0 \
675744
--generation-mode realistic \
676745
--performance-profile throughput \
677746
--seed 42 \
678747
--output mlperf_v3_storage_submission_70b.json
679748
```
680749

750+
**Why `cpu-mem-gb 0`?**
751+
In previous versions, a small CPU budget (e.g., 2GB) was allowed. However, analysis showed that operating system file caching (Page Cache) could absorb write bursts within this budget, artificially lowering latency metrics. Setting both GPU and CPU memory to 0 forces the "Waterfall" logic to bypass all caching layers and write directly to the NVMe backend, providing the most rigorous and honest assessment of storage I/O performance.
752+
681753
**Key Parameters Explained:**
682754
* `--num-users 150`: A high, fixed user count is used to ensure the storage device is placed under significant and continuous load.
683755
* `--duration 600`: A 10-minute duration ensures the benchmark reaches a stable, steady-state performance level, which is a standard requirement for MLPerf results.
@@ -872,7 +944,7 @@ python3 kv-cache.py \
872944
--num-users 50 \
873945
--duration 180 \
874946
--gpu-mem-gb 0 \
875-
--cpu-mem-gb 0.5 \
947+
--cpu-mem-gb 0 \
876948
--generation-mode realistic \
877949
--cache-dir /mnt/nvme \
878950
--seed 42 \
@@ -925,7 +997,7 @@ python3 kv-cache.py \
925997
--num-users 10 \
926998
--duration 180 \
927999
--gpu-mem-gb 0 \
928-
--cpu-mem-gb 32 \
1000+
--cpu-mem-gb 0 \
9291001
--enable-autoscaling \
9301002
--autoscaler-mode capacity \
9311003
--generation-mode none \
@@ -1004,4 +1076,129 @@ python3 kv-cache.py \
10041076
--cache-dir /mnt/nvme \
10051077
--seed 42 \
10061078
--output results_max_stress.json
1007-
```
1079+
```
1080+
1081+
### Test 9: ShareGPT Workload Replay
1082+
1083+
**Purpose:** Validates system performance against a trace of real-world human-AI conversations. This is the closest approximation to running a production service. It uses the dedicated replay script [`kv-cache_sharegpt_replay.py`](kv-cache_sharegpt_replay.py ).
1084+
1085+
```bash
1086+
python3 kv-cache_sharegpt_replay.py \
1087+
--model llama3.1-70b-instruct \
1088+
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
1089+
--max-conversations 1000 \
1090+
--gpu-mem-gb 0 \
1091+
--cpu-mem-gb 2 \
1092+
--cache-dir /mnt/nvme \
1093+
--num-users 50 \
1094+
--duration 300 \
1095+
--generation-mode none \
1096+
--output results_sharegpt_replay.json
1097+
```
1098+
1099+
---
1100+
1101+
# CHANGES-12-05-2025: The "Waterfall" Architecture & Optimization
1102+
1103+
**Date:** December 5, 2025
1104+
**Subject:** Major architectural upgrade to `kv-cache-waterfall-lru.py`.
1105+
1106+
This update introduces a fundamental shift in how the benchmark manages memory, moving from a simple "Spillover" model to a sophisticated "Waterfall" eviction strategy. It also addresses a critical CPU bottleneck that was masking true storage performance.
1107+
1108+
## 1. Architectural Shift: From Spillover to Waterfall
1109+
1110+
The original benchmark used a **Spillover** strategy. When the GPU was full, new data was forced directly into the CPU (and then NVMe).
1111+
* **The Problem:** New data is often the "hottest" (most likely to be read again soon). By forcing it to the slowest tier, we were penalizing active conversations. Meanwhile, old, cold data sat comfortably in the GPU, wasting valuable VRAM.
1112+
* **The Solution (Waterfall):** The new implementation enforces a strict hierarchy. New data **always** targets the fastest tier (GPU).
1113+
* If the GPU is full, the system identifies the **Least Recently Used (LRU)** item in the GPU and moves it to the CPU to make room.
1114+
* If the CPU is full, it moves the CPU's LRU item to NVMe.
1115+
* **Result:** The hottest data stays fast. Only truly cold data "falls" down the waterfall to storage. This mimics the behavior of production-grade caching systems like Redis or vLLM.
1116+
1117+
### The Waterfall Flow
1118+
1119+
```ascii
1120+
[ New Data ]
1121+
|
1122+
v
1123+
+-------------+ (Full?) +-------------+ (Full?) +-------------+
1124+
| GPU Tier | --------------> | CPU Tier | --------------> | NVMe Tier |
1125+
| (Fastest) | Evict LRU | (Medium) | Evict LRU | (Slowest) |
1126+
+-------------+ +-------------+ +-------------+
1127+
^ ^ ^
1128+
| | |
1129+
[ Hot Access ] [ Warm Access ] [ Cold Access ]
1130+
```
1131+
1132+
### Implementation: Recursive Eviction
1133+
1134+
The core logic resides in `_ensure_space_in_tier`. It recursively clears space in lower tiers to make room for demotions from higher tiers.
1135+
1136+
```python
1137+
def _ensure_space_in_tier(self, tier: str, required_bytes: int, recursion_depth: int = 0) -> bool:
1138+
# ... (recursion limits and checks omitted) ...
1139+
1140+
# Find the LRU entry in this tier
1141+
lru_entries = self._get_lru_entries_in_tier(tier)
1142+
lru_key, lru_entry = lru_entries[0]
1143+
lru_size = lru_entry['size']
1144+
1145+
# Recursively ensure the next tier has space for this entry
1146+
# This triggers the "Waterfall" effect down the hierarchy
1147+
if not self._ensure_space_in_tier(next_tier, lru_size, recursion_depth + 1):
1148+
return False
1149+
1150+
# Demote the LRU entry to the next tier
1151+
success, _ = self._demote_entry(lru_key, tier, next_tier)
1152+
```
1153+
1154+
## 2. Removing the CPU Bottleneck: Static Noise Buffers
1155+
1156+
**The Issue:**
1157+
Profiling the original script revealed that `np.random.uniform`—the function used to generate the dummy KV cache data—was consuming massive amounts of CPU time.
1158+
* **Impact:** The CPU was spending so much time generating random numbers that it couldn't issue storage I/O requests fast enough. The benchmark was measuring the speed of Python's random number generator, not the speed of the NVMe drive.
1159+
1160+
**The Fix:**
1161+
We replaced dynamic generation with a **Static Noise Buffer**.
1162+
* **Mechanism:** At startup, the benchmark pre-allocates a 256MB block of random noise in memory.
1163+
* **Zero-Copy Slicing:** When a request needs 10MB of data, instead of generating 10MB of new numbers, the system simply takes a "slice" (a view) of the pre-existing buffer.
1164+
* **Result:** Data generation is now effectively instant (zero CPU cost). This ensures that 100% of the latency measured is due to the storage subsystem, providing a true test of hardware performance.
1165+
1166+
```python
1167+
class KVCacheGenerator:
1168+
def __init__(self, model_config: ModelConfig, global_seed: Optional[int] = None):
1169+
# Pre-allocate a large buffer of random noise (e.g., 256MB)
1170+
self.buffer_size_elements = 128 * 1024 * 1024
1171+
self.precomputed_buffer = rng.uniform(-1.0, 1.0, size=self.buffer_size_elements).astype(self.dtype)
1172+
1173+
def generate(self, sequence_length: int, key: Optional[str] = None) -> np.ndarray:
1174+
# ... (shape calculation omitted) ...
1175+
1176+
# Zero-Copy Slicing: Take a view of the pre-existing buffer
1177+
if total_elements <= self.buffer_size_elements:
1178+
flat_view = self.precomputed_buffer[start_idx : start_idx + total_elements]
1179+
return flat_view.reshape(kv_shape)
1180+
```
1181+
1182+
## 3. Concurrency Hardening
1183+
1184+
Implementing the Waterfall strategy introduced complex race conditions, where multiple threads might try to evict the same item or claim the same free space simultaneously.
1185+
* **Atomic Reservations:** We implemented a "check-and-reserve" logic inside the memory locks. A thread now claims space *before* it starts writing, preventing over-subscription.
1186+
* **Loop Protection:** We added hard caps to the eviction loops. In a pathological case where the system is thrashing, the eviction logic will now abort rather than spinning infinitely, preventing the benchmark from hanging.
1187+
1188+
```python
1189+
# Inside _ensure_space_in_tier
1190+
with self.memory_lock:
1191+
current_usage = self._get_tier_usage(tier)
1192+
# Check if we have space
1193+
if current_usage + required_bytes <= target_usage:
1194+
# ATOMIC RESERVATION: Claim the space immediately inside the lock.
1195+
# This prevents other threads from seeing this space as free.
1196+
self._update_tier_usage(tier, required_bytes)
1197+
return True
1198+
```
1199+
1200+
## 4. Enhanced Metrics: NVMe Token Throughput
1201+
1202+
To align with MLPerf requirements, we added a specific counter for `nvme_tokens_processed`.
1203+
* **Why:** Previously, we tracked raw bytes. However, MLPerf metrics are often in "Tokens per Second."
1204+
* **How:** The system now tracks the exact number of tokens associated with every read, write, and demotion operation that touches the NVMe drive. This allows us to report a precise "Storage Throughput (tok/s)" metric that accounts for the massive read amplification inherent in LLM inference.

0 commit comments

Comments
 (0)