-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Hi,
I noticed while reading the source code and using it to evaluate models that some ground truth are missed when there are no detections, which results in a higher Average Precision than required.
I don't have any estimate of how much it impacts scoring on real predictions (although I suspect it would boost hard to get, rare classes) but targeted submissions may exploit it by having a few high confidence prediction on a class an then not predict it on other images to artificially boost AP, precision, recall and thus mAP.
It happens in 2 cases:
- Whenever a detector doesn't predict anything on a tile
- Whenever a detector fail to predict a class present on the image
The issue is due to the fact that when the Matching object is provided with an empty detection list for a class (which happens in both cases). The greedy_match() method returns 2 empty lists in this case (https://github.com/DIUx-xView/baseline/blob/master/scoring/matching.py#L93-L96) and the number of ground truths for every class is not incremented because it relies on the output of greedy_match() to do so (https://github.com/DIUx-xView/baseline/blob/master/scoring/score.py#L239-L244). This leads to ignoring False Negatives and thus increasing recall (and boosting AP in the process) for every class present in the ground truth of the image where it happened.
Changing greedy_match() to always return gt_rects_matched or changing https://github.com/DIUx-xView/baseline/blob/master/scoring/score.py#L244 from num_gt_per_cls[i] += len(gt_matched) to num_gt_per_cls[i] += len(gt_rects) might fix the issue from my understanding.
I don't know if this behaviour is knowingly expected but I wanted to share my concerns over a potential exploit for artificially high scored submissions.