Skip to content

On validation metrics and thresholds #4

@SmilingWolf

Description

@SmilingWolf

First of all, nice job!

I noticed in the validation arena you're using my suggested thresholds for my models, and a "default" one for yours.
That's doing your work a disservice.
I think a fairer way to compare the models would be to try and find some fixed performance point, and see how the other metrics fare.

For my models, for example, I used to choose (by bisection) a threshold where micro averaged recall and precision matched: if both were higher than the last model then I had a better model.
You could do the same, or bisect towards a threshold that gives a desired precision and evaluate recall for example.
This also has the side effect of being more fair to augmentations like mixup, that skew predictions confidence towards lower values.

If I may go on a slight tangent about the discrepancy between my stated scores and the ones in the Arena: I used to use micro averaging, while you're calculating macro averages. Definitely keep using macro averaging for the metrics, I started using it too in my newer codebase over at https://github.com/SmilingWolf/JAX-CV (posting the repo in case you consider using it if you decide to apply to TRC).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions