Run tests multiple times, and collect relevant stats

Currently, we define an `error_margin` as a fixed percentage.  We don't know how stable the tests are, and for e.g. what the expected standard deviation is.

It would be more stable (and informative) to run the tests multiple times, and collect relevant stats like min/max/median/stdev.