Much higher WER for gpt4o-mini, compared with published leaderboard

This is intended as a question/request for advice rather than an issue. Thanks 🙏 

I am trying to use HELM as a  benchmark for audio models, and I've been trying to reproduce the HELM [leaderboard](https://crfm.stanford.edu/helm/audio/latest/#/leaderboard) results for the `librispeech-WER` scenario with `GPT-4o mini Audio (Preview 2024-12-17)`

Running HELM locally with the same dataset, number of samples etc I consistently see WER ~ 0.36 compared with 0.163 shown in the published leaderboard.

My current theory is that openai safety filters have changed, so that even using the same checkpoint we now get many more refusals ( things like "I'm unable to assist with that"), and this pushes up the mean WER per sample a lot.

**So questions are: am I right that this metric has gone up for the same checkpoint? And does my theory re filter changes make sense?**

For reference, this is how I run the eval to reproduce the leaderboard for this scenario + model

```
#!/usr/bin/env bash
set -euo pipefail

export SUITE_NAME=audio_leaderboard
export MODELS_TO_RUN=openai/gpt-4o-mini-audio-preview-2024-12-17
export RUN_ENTRIES_CONF_PATH=src/helm/benchmark/presentation/run_entries_audio.conf
export SCHEMA_PATH=src/helm/benchmark/static/schema_audio.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=1000
export PRIORITY=1

helm-run \
  --conf-paths "$RUN_ENTRIES_CONF_PATH" \
  --num-train-trials "$NUM_TRAIN_TRIALS" \
  --max-eval-instances "$MAX_EVAL_INSTANCES" \
  --priority "$PRIORITY" \
  --suite "$SUITE_NAME" \
  --models-to-run "$MODELS_TO_RUN" \
  --groups-to-run librispeech

helm-summarize --schema "$SCHEMA_PATH" --suite "$SUITE_NAME"

helm-server --suite "$SUITE_NAME"
```

Note that we'd need to run this with `disable-cache` if running it multiple times, otherwise you'd see the cached results the next time. 

## Addendum: evidence for increased number of refusals

The reason for my theory is that I see a much higher rate of refusals in the results:

- Leaderboard: 38 refusals; avg length of refusal 7.89 words; median 7; min 7; max 18.
- Local run: 62 refusals; avg length 15.0 words; median 21; min 7; max 24.

So the number of refusals has increased, which would push up mean WER, and the no. of words in  the refusal text itself has increased, which would compound the same effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Much higher WER for gpt4o-mini, compared with published leaderboard #3971

Addendum: evidence for increased number of refusals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Much higher WER for gpt4o-mini, compared with published leaderboard #3971

Description

Addendum: evidence for increased number of refusals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions