An automated benchmarking framework that iteratively identifies and runs new experiments to narrow down the root cause of performance differences between CPP and Python transceiver runtimes in disaggregated KV cache serving.
Given a black-box simulator (simulator.py), the system analyzes results from each round of benchmarks, determines which configuration knobs to vary next, and converges on the key factors driving performance gaps.