Some of the benchmarks are testing functions which behave differently on the first run because they cache data that they then use in repeated calls.
We should split the benchmark measuring such functions into warm and cold benchmarks. Otherwise the benchmarks have a large spread between the max value (cold run) and the min value and the mean becomes less meaningful.