Skip to content

Much higher WER for gpt4o-mini, compared with published leaderboardΒ #3971

@samrae7

Description

@samrae7

This is intended as a question/request for advice rather than an issue. Thanks πŸ™

I am trying to use HELM as a benchmark for audio models, and I've been trying to reproduce the HELM leaderboard results for the librispeech-WER scenario with GPT-4o mini Audio (Preview 2024-12-17)

Running HELM locally with the same dataset, number of samples etc I consistently see WER ~ 0.36 compared with 0.163 shown in the published leaderboard.

My current theory is that openai safety filters have changed, so that even using the same checkpoint we now get many more refusals ( things like "I'm unable to assist with that"), and this pushes up the mean WER per sample a lot.

So questions are: am I right that this metric has gone up for the same checkpoint? And does my theory re filter changes make sense?

For reference, this is how I run the eval to reproduce the leaderboard for this scenario + model

#!/usr/bin/env bash
set -euo pipefail

export SUITE_NAME=audio_leaderboard
export MODELS_TO_RUN=openai/gpt-4o-mini-audio-preview-2024-12-17
export RUN_ENTRIES_CONF_PATH=src/helm/benchmark/presentation/run_entries_audio.conf
export SCHEMA_PATH=src/helm/benchmark/static/schema_audio.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=1000
export PRIORITY=1

helm-run \
  --conf-paths "$RUN_ENTRIES_CONF_PATH" \
  --num-train-trials "$NUM_TRAIN_TRIALS" \
  --max-eval-instances "$MAX_EVAL_INSTANCES" \
  --priority "$PRIORITY" \
  --suite "$SUITE_NAME" \
  --models-to-run "$MODELS_TO_RUN" \
  --groups-to-run librispeech

helm-summarize --schema "$SCHEMA_PATH" --suite "$SUITE_NAME"

helm-server --suite "$SUITE_NAME"

Note that we'd need to run this with disable-cache if running it multiple times, otherwise you'd see the cached results the next time.

Addendum: evidence for increased number of refusals

The reason for my theory is that I see a much higher rate of refusals in the results:

  • Leaderboard: 38 refusals; avg length of refusal 7.89 words; median 7; min 7; max 18.
  • Local run: 62 refusals; avg length 15.0 words; median 21; min 7; max 24.

So the number of refusals has increased, which would push up mean WER, and the no. of words in the refusal text itself has increased, which would compound the same effect.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions