-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Due to jittery nature of perf testing we usually get some failures that are only outliers but they still result in compareperf to return non-zero output. It'd be nice to define conditions under which we consider a comparison as good. This would be very useful for build reporting purposes as well as for bisection.
One way to judge would be to specify a percentage of acceptable failures (in total, per group, ...). Another would be to focus on reference builds and be more lenient on builds with a great jittery while failing in case of a stable failure rate. Of course we can add multiple metrics to allow the operators to define better rules.
Alternatively we can just report some aggregated info and let the users to process them afterwards, but a better handling would still be IMO useful.
Bonus: It'd be nice to allow ML identification and models support for deciding on the build status, but that is currently out of scope.