You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Replace legacy spillover logic with Waterfall LRU architecture
This is a major architectural upgrade to the core benchmark logic. Replacing
the original "Spillover" memory management strategy with the new "Waterfall
LRU" implementation to accurately simulate enterprise storage hierarchies.
Key Changes:
- Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe).
New data now correctly lands in the fastest available tier, pushing cold
data down, rather than the old behavior where new data skipped directly
to NVMe if RAM was full.
- Static Buffer Optimization: Replaced the CPU-bound np.random generation
with a pre-allocated static noise buffer. This removes the CPU bottleneck
that was masking true storage latency, allowing us to fully saturate
high-performance NVMe drives.
- Concurrency Hardening: Added semaphore-based concurrency limits
(max_concurrent_allocs) and atomic memory reservations to prevent OOM
crashes under heavy load.
- Storage Metrics: Added explicit tracking for nvme_tokens_processed to
calculate true storage throughput separate from system throughput.
- Stress Test Validation: Verified that this new architecture correctly
exposes storage latency limits (e.g., pushing P95 write latency >1000ms)
where the old script artificially throttled the load.
Copy file name to clipboardExpand all lines: kv_cache_benchmark/MLperf v3 KV cache proposal.md
+205-8Lines changed: 205 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -529,6 +529,75 @@ The benchmark copies that pattern with three simple pieces:
529
529
530
530
In the summary you will see both numbers. A high reuse count with few hits simply says the prompt was detected but the stored copy had already been evicted, just like what operators watch for in production.
531
531
532
+
### J. ShareGPT Replay: Realistic Workload Simulation
533
+
534
+
While synthetic workloads (using random token counts within a range) are excellent for controlled stress testing, they may not fully capture the nuances of human-AI interaction. The **ShareGPT Replay** feature addresses this by loading real conversation trees from the ShareGPT dataset.
535
+
536
+
**How it works:**
537
+
1.**Ingestion:** The `ShareGPTDatasetLoader` parses a JSON dataset of real conversations. It uses a tokenizer to calculate the exact `context_tokens` (user prompt) and `generate_tokens` (model response) for every turn.
538
+
2.**Replay:** Instead of generating random requests, the benchmark feeds these real token counts into the `InferenceRequest` queue.
539
+
3.**Structure Preservation:** Crucially, it preserves the multi-turn structure of the data. Request 2 is guaranteed to be a follow-up to Request 1, testing the `MultiTierCache`'s ability to handle real conversational locality.
540
+
541
+
**Case Study: Analyzing ShareGPT Results**
542
+
Running a replay with the `llama3.1-70b-instruct` model on a memory-constrained system (2GB CPU RAM) reveals bottlenecks often hidden by uniform random distributions.
543
+
544
+
***High Cache Hit Rate (97.2%):** Real conversations exhibit high locality. Users ask follow-up questions, allowing the system to reuse the KV cache effectively.
545
+
***NVMe Read Latency Spikes (291ms P95):** Unlike synthetic tests which might average around a mean, real user inputs vary wildly. A single request with a 16k token context can saturate the read bandwidth, pushing the P95 latency above the 200ms target, resulting in a "FAIL" assessment for storage even if throughput is high.
546
+
547
+
**Sample Output Summary:**
548
+
```text
549
+
### STORAGE PERFORMANCE ASSESSMENT: FAIL ✗ ###
550
+
Criteria Passed: 3/4
551
+
✓ NVMe Write P95 < 500ms: 54.50ms
552
+
✗ NVMe Read P95 < 200ms: 291.11ms (Target: 200ms)
553
+
✓ Cache Hit Rate > 30%: 97.2%
554
+
555
+
### CACHE TIER DISTRIBUTION ###
556
+
GPU Entries: 0 (0.00 GB)
557
+
CPU Entries: 156 (1.60 GB)
558
+
NVMe Entries: 1772 (92% of cache on slow storage)
559
+
```
560
+
561
+
### K. The Importance of Realism: A Comparative Case Study
562
+
563
+
To illustrate why workload realism matters, we compared two runs of the benchmark on identical hardware (50 users, 70B model, NVMe-only cache).
564
+
565
+
**Run A: Real Workload (ShareGPT)**
566
+
This run uses the actual conversation data, reflecting human usage patterns.
This run omits the dataset, causing the benchmark to fall back to generating random, full-length contexts. This represents a "worst-case" scenario (e.g., massive document processing) rather than a chat workload.
|**End-to-End P50**| 93 ms | 121,158 ms (2 min) |**System Collapse**|
593
+
594
+
**Key Findings:**
595
+
1.**Context Size Explosion:** Real human queries are concise (avg 133 tokens). The synthetic generator, aiming for coverage, produced contexts averaging 2,676 tokens. This forced the storage system to read/write **20x more data per request** in the synthetic run.
596
+
2.**System Collapse:** In the synthetic run, the P50 end-to-end latency ballooned to **2 minutes**, while the storage latency was only ~4 seconds. This indicates the system was in a state of **thrashing**, where requests spent 95% of their time waiting in the queue because the storage was saturated handling massive files.
597
+
3.**Cache Efficiency:** Real conversations have high locality (85.9% multi-turn hit rate) because users ask follow-up questions. The synthetic run had a much lower hit rate (60.1%), further stressing the storage.
598
+
599
+
**Conclusion:** Run A represents a realistic chatbot application, where the NVMe drive is nearly sufficient. Run B represents a worst-case scenario, proving that for such heavy workloads, the current hardware configuration is inadequate.
600
+
532
601
---
533
602
534
603
## 6. Current Work: Validating Simulation Accuracy with vLLM
@@ -644,16 +713,16 @@ Two primary scenarios should be submitted to give a comprehensive view of storag
644
713
645
714
#### Standard Submission: `llama3.1-8b`
646
715
647
-
This workload provides a baseline for storage performance under typical conditions. A fixed seed is required to ensure the workload is identical for all submissions, enabling fair and reproducible comparisons.
716
+
This workload provides a baseline for storage performance under typical conditions. **Note:** We set `cpu-mem-gb 0` to disable the caching tier entirely, forcing every token to hit the NVMe drive. This ensures the benchmark measures the storage hardware, not the OS file cache.
648
717
649
718
```bash
650
719
# MLPerf v3.0 Recommended Invocation: Storage Saturation Test (8B Model)
651
-
python3 kv-cache.py \
720
+
python3 kv-cache-waterfall-lru.py \
652
721
--model llama3.1-8b \
653
722
--num-users 150 \
654
723
--duration 600 \
655
724
--gpu-mem-gb 0 \
656
-
--cpu-mem-gb 2 \
725
+
--cpu-mem-gb 0 \
657
726
--generation-mode realistic \
658
727
--performance-profile throughput \
659
728
--seed 42 \
@@ -666,18 +735,21 @@ This workload tests the storage's ability to handle a much heavier load, as the
666
735
667
736
```bash
668
737
# MLPerf v3.0 Recommended Invocation: Storage Saturation Test (70B Model)
669
-
python3 kv-cache.py \
738
+
python3 kv-cache-waterfall-lru.py \
670
739
--model llama3.1-70b-instruct \
671
740
--num-users 40 \
672
741
--duration 600 \
673
742
--gpu-mem-gb 0 \
674
-
--cpu-mem-gb 4 \
743
+
--cpu-mem-gb 0 \
675
744
--generation-mode realistic \
676
745
--performance-profile throughput \
677
746
--seed 42 \
678
747
--output mlperf_v3_storage_submission_70b.json
679
748
```
680
749
750
+
**Why `cpu-mem-gb 0`?**
751
+
In previous versions, a small CPU budget (e.g., 2GB) was allowed. However, analysis showed that operating system file caching (Page Cache) could absorb write bursts within this budget, artificially lowering latency metrics. Setting both GPU and CPU memory to 0 forces the "Waterfall" logic to bypass all caching layers and write directly to the NVMe backend, providing the most rigorous and honest assessment of storage I/O performance.
752
+
681
753
**Key Parameters Explained:**
682
754
*`--num-users 150`: A high, fixed user count is used to ensure the storage device is placed under significant and continuous load.
683
755
*`--duration 600`: A 10-minute duration ensures the benchmark reaches a stable, steady-state performance level, which is a standard requirement for MLPerf results.
@@ -872,7 +944,7 @@ python3 kv-cache.py \
872
944
--num-users 50 \
873
945
--duration 180 \
874
946
--gpu-mem-gb 0 \
875
-
--cpu-mem-gb 0.5 \
947
+
--cpu-mem-gb 0 \
876
948
--generation-mode realistic \
877
949
--cache-dir /mnt/nvme \
878
950
--seed 42 \
@@ -925,7 +997,7 @@ python3 kv-cache.py \
925
997
--num-users 10 \
926
998
--duration 180 \
927
999
--gpu-mem-gb 0 \
928
-
--cpu-mem-gb 32 \
1000
+
--cpu-mem-gb 0 \
929
1001
--enable-autoscaling \
930
1002
--autoscaler-mode capacity \
931
1003
--generation-mode none \
@@ -1004,4 +1076,129 @@ python3 kv-cache.py \
1004
1076
--cache-dir /mnt/nvme \
1005
1077
--seed 42 \
1006
1078
--output results_max_stress.json
1007
-
```
1079
+
```
1080
+
1081
+
### Test 9: ShareGPT Workload Replay
1082
+
1083
+
**Purpose:** Validates system performance against a trace of real-world human-AI conversations. This is the closest approximation to running a production service. It uses the dedicated replay script [`kv-cache_sharegpt_replay.py`](kv-cache_sharegpt_replay.py).
# CHANGES-12-05-2025: The "Waterfall" Architecture & Optimization
1102
+
1103
+
**Date:** December 5, 2025
1104
+
**Subject:** Major architectural upgrade to `kv-cache-waterfall-lru.py`.
1105
+
1106
+
This update introduces a fundamental shift in how the benchmark manages memory, moving from a simple "Spillover" model to a sophisticated "Waterfall" eviction strategy. It also addresses a critical CPU bottleneck that was masking true storage performance.
1107
+
1108
+
## 1. Architectural Shift: From Spillover to Waterfall
1109
+
1110
+
The original benchmark used a **Spillover** strategy. When the GPU was full, new data was forced directly into the CPU (and then NVMe).
1111
+
***The Problem:** New data is often the "hottest" (most likely to be read again soon). By forcing it to the slowest tier, we were penalizing active conversations. Meanwhile, old, cold data sat comfortably in the GPU, wasting valuable VRAM.
1112
+
***The Solution (Waterfall):** The new implementation enforces a strict hierarchy. New data **always** targets the fastest tier (GPU).
1113
+
* If the GPU is full, the system identifies the **Least Recently Used (LRU)** item in the GPU and moves it to the CPU to make room.
1114
+
* If the CPU is full, it moves the CPU's LRU item to NVMe.
1115
+
***Result:** The hottest data stays fast. Only truly cold data "falls" down the waterfall to storage. This mimics the behavior of production-grade caching systems like Redis or vLLM.
## 2. Removing the CPU Bottleneck: Static Noise Buffers
1155
+
1156
+
**The Issue:**
1157
+
Profiling the original script revealed that `np.random.uniform`—the function used to generate the dummy KV cache data—was consuming massive amounts of CPU time.
1158
+
***Impact:** The CPU was spending so much time generating random numbers that it couldn't issue storage I/O requests fast enough. The benchmark was measuring the speed of Python's random number generator, not the speed of the NVMe drive.
1159
+
1160
+
**The Fix:**
1161
+
We replaced dynamic generation with a **Static Noise Buffer**.
1162
+
***Mechanism:** At startup, the benchmark pre-allocates a 256MB block of random noise in memory.
1163
+
***Zero-Copy Slicing:** When a request needs 10MB of data, instead of generating 10MB of new numbers, the system simply takes a "slice" (a view) of the pre-existing buffer.
1164
+
***Result:** Data generation is now effectively instant (zero CPU cost). This ensures that 100% of the latency measured is due to the storage subsystem, providing a true test of hardware performance.
Implementing the Waterfall strategy introduced complex race conditions, where multiple threads might try to evict the same item or claim the same free space simultaneously.
1185
+
***Atomic Reservations:** We implemented a "check-and-reserve" logic inside the memory locks. A thread now claims space *before* it starts writing, preventing over-subscription.
1186
+
***Loop Protection:** We added hard caps to the eviction loops. In a pathological case where the system is thrashing, the eviction logic will now abort rather than spinning infinitely, preventing the benchmark from hanging.
1187
+
1188
+
```python
1189
+
# Inside _ensure_space_in_tier
1190
+
withself.memory_lock:
1191
+
current_usage =self._get_tier_usage(tier)
1192
+
# Check if we have space
1193
+
if current_usage + required_bytes <= target_usage:
1194
+
# ATOMIC RESERVATION: Claim the space immediately inside the lock.
1195
+
# This prevents other threads from seeing this space as free.
1196
+
self._update_tier_usage(tier, required_bytes)
1197
+
returnTrue
1198
+
```
1199
+
1200
+
## 4. Enhanced Metrics: NVMe Token Throughput
1201
+
1202
+
To align with MLPerf requirements, we added a specific counter for `nvme_tokens_processed`.
1203
+
***Why:** Previously, we tracked raw bytes. However, MLPerf metrics are often in "Tokens per Second."
1204
+
***How:** The system now tracks the exact number of tokens associated with every read, write, and demotion operation that touches the NVMe drive. This allows us to report a precise "Storage Throughput (tok/s)" metric that accounts for the massive read amplification inherent in LLM inference.
0 commit comments