Potential enhancements: * additional parameter to subset labels to average over (e.g., ignore majority label with very high score) * allow strings that can be passed to HF evaluate functions