Skip to content

Potential wrong computation of the precision #44

@rpautrat

Description

@rpautrat

🐛 Bug

It seems that the evaluation pipeline computes a wrong precision. When the evaluated detector never predicts a category C (e.g. with a closed set detector with less classes than in LVIS), the current implementation will compute a precision of 0 for that category, while it should ignore it (no prediction = no precision).

To Reproduce

lvis-api/lvis/eval.py

Lines 370 to 405 in 7d7f07d

for iou_thr_idx, (tp, fp) in enumerate(zip(tp_sum, fp_sum)):
tp = np.array(tp)
fp = np.array(fp)
num_tp = len(tp)
rc = tp / num_gt
if num_tp:
recall[iou_thr_idx, cat_idx, area_idx] = rc[
-1
]
else:
recall[iou_thr_idx, cat_idx, area_idx] = 0
# np.spacing(1) ~= eps
pr = tp / (fp + tp + np.spacing(1))
pr = pr.tolist()
# Replace each precision value with the maximum precision
# value to the right of that recall level. This ensures
# that the calculated AP value will be less suspectable
# to small variations in the ranking.
for i in range(num_tp - 1, 0, -1):
if pr[i] > pr[i - 1]:
pr[i - 1] = pr[i]
rec_thrs_insert_idx = np.searchsorted(
rc, self.params.rec_thrs, side="left"
)
pr_at_recall = [0.0] * num_recalls
try:
for _idx, pr_idx in enumerate(rec_thrs_insert_idx):
pr_at_recall[_idx] = pr[pr_idx]
except:
pass
precision[iou_thr_idx, :, cat_idx, area_idx] = np.array(pr_at_recall)

This code snippet is the one computing the precision at all recall thresholds. When the current category (with index cat_idx) has never been predicted in the dataset, tp_sum and fp_sum will have a shape of [num_IoU_thresholds, 0], because the second dimension has the length of the number of predictions for the current category in the total dataset.

In this scenario, the try/except statement

try:
will fail, because pr will be empty. Then the precision for that category will be set to the default (defined in
pr_at_recall = [0.0] * num_recalls
) of 0.

Expected behavior

I would expect the precision to remain at -1 (i.e. ignored in the final computation of the precision) in this scenario, because the detector has not predicted the class at all, so it is unfair to receive a precision of 0.

Proposed fix

A simple fix would be to do the following in l.375-380:

                    if num_tp:
                        recall[iou_thr_idx, cat_idx, area_idx] = rc[
                            -1
                        ]
                    else:
                        recall[iou_thr_idx, cat_idx, area_idx] = 0
                        # If there are no detection for that category, the precision is undefined.
                        continue

If num_tp = len(tp) = 0, this means that there were no detections for that category, so this is exactly the scenario I am describing here. In this case, the recall is 0, and we stop here without computing the precision, which will stay at its defaults of -1.

Let me know what you think of this finding, or if I made a mistake in my reasoning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions