You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do some improvements in the back analisys for the revert logic with the
goal of improving precision and recall and validate as a valid strategy.
Checked against the workflows: pull trunk inductor
linux-binary-manywheel
Old code:
```
Timeframe: 720 hours
Commits checked: 6177
Auto revert patterns detected: 188
Actual reverts inside auto revert patterns detected: 24 (12.8%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected: 91
```
Newer code:
```
Workflow(s): pull, trunk, inductor, linux-binary-manywheel
Timeframe: 720 hours
Commits checked: 5403
Auto revert patterns detected: 442
Actual reverts inside auto revert patterns detected (precision): 48 (10.9%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%)
Per workflow precision:
pull: 45 reverts out of 411 patterns (10.9%)
trunk: 1 reverts out of 8 patterns (12.5%)
inductor: 2 reverts out of 20 patterns (10.0%)
linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%)
```
Critical implemented changes:
* Look forward and back for the first commit that ran the failed job,
instead of trusting on always looking on the one right before or right
after.
* Job names have parts we don't care, like shards indices. As a failure
could happen in any shard we want to find any shard with the same
failure;
Things I tried and don't lead to great results:
* ignoring error classification - too low precision, not significant
increase in recall
* not requiring error repetition - too low precision, not significant
increase in recall
My take:
With a precision of 10% it justifies the cost of re-running jobs in
order to confirm redness status, even if it is not possible to test, I
suspect that the fact we force require the same output 2 times for all 3
signals, this should elevate the precision to a very high standard.
Unfortunately the only way to test is run this in shadow mode.
With a recall of 55%, it points out to being able to capture **most** of
the introduced trunk redness errors. Lots of reverts might not be caused
by ci redness, especially not in the workflows we are analyzing (could
be performance degradation, GHF/internal reasons and many others). This
number seems comfortable to provide a substantial gain in benefit for CI
quality.
0 commit comments