I found that there is a discrepancy between the message size that the OSU benchmarks report and the size that is used by coll/tuned to make tuning decisions: the OSU benchmarks report the size of the message each rank sends while coll/tuned bases it's decision in allgatherv on the total amount of data to be received. This leads to nonsensical rules and likely suboptimal decisions. This should be fixed in the python scripts (when generating the decision file and ideally also when writing the best.out file).