Skip to content

Conversation

@pcsid
Copy link
Collaborator

@pcsid pcsid commented Dec 19, 2025

…l in need of full model scoring. Fixed small inconsistency bug in config by changing judge_properties to judge_settings.

📌 Description

🔗 Related Issue(s)

🛠️ Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality including new tasks)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactor / Code cleanup
  • Maintenance / Chore / Task
  • Other (please describe):

✅ How Has This Been Tested?

  • Unit tests
  • Integration tests
  • Manual testing

Test Results / Screenshots (if applicable):

📸 Screenshots / Demos

📋 Checklist

  • Code follows project style guidelines
  • Tests have been added/updated (if applicable)
  • Documentation has been updated (if applicable)
  • Linked relevant issue(s)
  • Self-reviewed my code

🙌 Additional Notes

  1. Added three new tasks - please test on full dataset, don't have access to cluster so can't test that many samples myself.
  • Phonemes tasks not working as expected as audio clips are too short(0.5-1 seconds), do model often fails to recognize audio.
  • Stuttering task worked fine, but only tested on few samples
  • Noise detection task worked fine, but only tested on few samples
  1. Small bug fix - reads judge_settings instead of judge_properties from the config - maintains similarity with other documentation and codebase pattern(judge_settings for top level, judge_properties when creating judge object itself)

…l in need of full model scoring. Fixed small inconsistency bug in config by changing judge_properties to judge_settings.
@pcsid pcsid self-assigned this Dec 19, 2025
@pcsid pcsid added bug Something isn't working enhancement New feature or request labels Dec 19, 2025
Copy link
Collaborator

@nhhoang96 nhhoang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to include the minimum results to compare against the existing reported results from current literature. Update in the PR Screenshots/ Demos section.

Performance results on phoneme and noise_detection tasks are low. Please inspect and submit output samples to verify the expected behaviors of the added tasks.

@nhhoang96 nhhoang96 self-assigned this Dec 30, 2025
@pcsid
Copy link
Collaborator Author

pcsid commented Jan 5, 2026

I checked the three new tasks against the reported scores. Here is the leaderboard I used - https://huggingface.co/spaces/DynamicSuperb/leaderboard

  1. The speech disoder task about stuttering scores finished at 57% for gpt-4o-audio. The reported scores from DynamicSuperb were all around 50%(random guessing), which made sense considering they are older, open source LLMs. The paper - https://arxiv.org/pdf/2102.12394 - with trained specialized audio models(not LLMs) scored much better, ranging from 61% to 67%.
Screenshot 2026-01-03 at 1 15 35 PM Screenshot 2026-01-04 at 5 29 36 PM
  1. The speech enhancement task on Gaussian Noise Detection for gpt-4-audio scored 53.5%, which was better than the reported scores on the Huggingface leaderboard, mostly from weaker open source models. These scores were probably very low on a 50/50 task due to refusals/improper prompting.
Screenshot 2026-01-04 at 5 38 04 PM Screenshot 2026-01-04 at 5 22 35 PM
  1. The Phonetics task for phoneme counting for gpt-4-audio scored 24.1%, which was better than the exact-match accuracy scores of the open source model on the DynamicSuperb Leaderboard, which ranged from 1% to 21%.
Screenshot 2026-01-04 at 5 48 46 PM Screenshot 2026-01-03 at 2 35 47 PM

For all 3 of these tasks, and especially the last two, LLMs as a whole done seem very good at solving these problems as they are very niche classification tasks, not well-suited for the strengths of LLMs.

@pcsid pcsid requested a review from nhhoang96 January 5, 2026 01:53
Copy link
Collaborator

@nhhoang96 nhhoang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nhhoang96 nhhoang96 merged commit a77e996 into main Jan 5, 2026
@nhhoang96 nhhoang96 deleted the feat/dynamic-tasks branch January 5, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants