-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hello!
Creating this issue to discuss what could be done to imitate what I did in the benchmark blogpost.
Currently, all benchmarks in this repository work by having a single K6 load test with a given scenario. It's unfortunately not enough to generate the latency vs throughput graphs I generated. Multiple benchmark runs are needed with a fixed arrival rate. The tricky part is that obviously, each gateway has a different maximum throughput. So to automate it, it would require a loop that does something like this:
# pseudo code
# Warmup for a few seconds
k6_run_fixed_arrival_rate(100)
# actual benchmark
arrival_rate = 100
step = 100
while true:
results = k6_run_fixed_arrival_rate(arrival_rate)
if (results['metrics']['iterations']['values']['rate'] - arrival_rate).abs() < 3:
arrival_rate += step
else:
step = step/2
if step <= 5:
break
arrival_rate -= step
That's roughly what I did. Manually 😢. The more difficult part is that I adjusted preAllocatedVUs
over the benchmark runs, but it's only important when close to being CPU-bound. So for an automated benchmark, I would just set it to something high enough for all gateways and that's it. High enough means that you shouldn't have a case where the actual throughput measured by K6 doesn't match the fixed arrival rate with a gateway that wasn't CPU-bound (max cpu < CPU_LIMIT).
The last part is something that could be done with some bash. What's particularly unclear to me is how you should generate the results table/graph. In my benchmarks, I wrote the results in JSON from docker stats
and k6
and processed them in a Python notebook, available here. I didn't clean it up, but if you're familiar with Python it shouldn't be too hard, I hope. :)
If you're considering adding this benchmark, I would also suggest adding a 10ms subgraph delay like I did in the end, it's more representative of a real-world case. In the current benchmarks, it's not noticeable (unless you add huge delays) because most of the latency comes from CPU contention in the first place. Adding a few dozen ms of network delay can't impact latencies in the order of seconds.
In your stead, I would also consider removing Grafana, Prometheus, and Cadvisor and instead, only use docker stats
or the Docker API to retrieve the resource consumption information. I would expect it to be the most precise measurement you can have easily and it would avoid using resources for this. This is not a small change, so well. 🤷
Regarding using the host network for docker, I did see an impact without any network delay for Grafbase. I don't remember if it did for the others. So I would tend to recommend it, but I'm not sure whether this works on MacOS/Windows.