-
Notifications
You must be signed in to change notification settings - Fork 364
Description
This is intended as a question/request for advice rather than an issue. Thanks π
I am trying to use HELM as a benchmark for audio models, and I've been trying to reproduce the HELM leaderboard results for the librispeech-WER scenario with GPT-4o mini Audio (Preview 2024-12-17)
Running HELM locally with the same dataset, number of samples etc I consistently see WER ~ 0.36 compared with 0.163 shown in the published leaderboard.
My current theory is that openai safety filters have changed, so that even using the same checkpoint we now get many more refusals ( things like "I'm unable to assist with that"), and this pushes up the mean WER per sample a lot.
So questions are: am I right that this metric has gone up for the same checkpoint? And does my theory re filter changes make sense?
For reference, this is how I run the eval to reproduce the leaderboard for this scenario + model
#!/usr/bin/env bash
set -euo pipefail
export SUITE_NAME=audio_leaderboard
export MODELS_TO_RUN=openai/gpt-4o-mini-audio-preview-2024-12-17
export RUN_ENTRIES_CONF_PATH=src/helm/benchmark/presentation/run_entries_audio.conf
export SCHEMA_PATH=src/helm/benchmark/static/schema_audio.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=1000
export PRIORITY=1
helm-run \
--conf-paths "$RUN_ENTRIES_CONF_PATH" \
--num-train-trials "$NUM_TRAIN_TRIALS" \
--max-eval-instances "$MAX_EVAL_INSTANCES" \
--priority "$PRIORITY" \
--suite "$SUITE_NAME" \
--models-to-run "$MODELS_TO_RUN" \
--groups-to-run librispeech
helm-summarize --schema "$SCHEMA_PATH" --suite "$SUITE_NAME"
helm-server --suite "$SUITE_NAME"
Note that we'd need to run this with disable-cache if running it multiple times, otherwise you'd see the cached results the next time.
Addendum: evidence for increased number of refusals
The reason for my theory is that I see a much higher rate of refusals in the results:
- Leaderboard: 38 refusals; avg length of refusal 7.89 words; median 7; min 7; max 18.
- Local run: 62 refusals; avg length 15.0 words; median 21; min 7; max 24.
So the number of refusals has increased, which would push up mean WER, and the no. of words in the refusal text itself has increased, which would compound the same effect.