At the moment, we run our benchmarking code manually to assess model changes and guard against regressions. This is imperfect because:
- It relies on someone manually checking match rates against our ground-truth datasets.
- It’s time-consuming to run more than one dataset when deploying changes.
- We don’t have a complete history of model performance over time.
Now that we have MLflow running in GitHub, it would make sense to set up a new repo (or workflow) that automatically runs model evaluation whenever we publish a new GitHub release, and tracks accuracy there.