-
Notifications
You must be signed in to change notification settings - Fork 67
Add Whisper-based SOTA model (record-breaking WER) #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added whisper based Pingala model by ShunyaLabs.
|
This PR only requests evaluation of one model and supersedes the earlier 2 PRs by the same team. Resolution of this PR will also enable the team to withdraw the other 2 PRs : PR 87 - This PR requested evaluation of 2 models which were gated. PR 80 - This PR requested the same 2 models but were proprietary and requested evaluation through API. |
|
Hey @vivek-shunyalabs - thanks for the PR and apologies for the delay - the reason for delay is there were some doubts in the community about your models evals and we haven't gotten around to running them/ investigating them yet see here: #87 (comment) Do note that the leaderboard is a joint effort across multiple orgs - we ask all leaderboard partners to approve evals before we put up. Curious Question: Do you have any public info on what data were these models fine-tuned on? (totally okay if not) Tagging @pzelasko @nithinraok from Nvidia NeMo team as well since we discussed this awhile back - if there are no objections and upon our successful run the numbers match we'll update the leaderboard. Side note: Steven is currently the main maintainer and he's on vacation till the end of the month. |
|
@Vaibhavs10 Thanks for your comment and explanation. Really appreciate it. The comment on PR-87 was addressed by the team already and besides the comment was not for EN ASR. In my opinion, that can be ignored. @pzelasko @nithinraok - I am available to answer any questions you might have on this. We have been already waiting for 2 months+ for listing our models. Will appreciate a quick turnaround on review and approval for this merge (possibly by 12th September, Friday). |
|
General disclaimer: The statements I make are in a personal capacity and should not be attributed to my employer. I evaluated this model and Whisper v3 large, Parakeet V2 and Canary-Qwen on three custom English test sets (A, B, C), and these are the results:
I do think that the Open ASR Leaderboard would benefit from its own collection of non-public test sets for validation of submissions at this point, as I mentioned in an earlier discussion with @Vaibhavs10 and @Deep-unlearning. |
|
Ooof! OOD scores look really bad - are there chances for leaderboard data contamination in your training data? @vivek-shunyalabs |
|
Hi Piotr/Vaibhav, Thank you for taking time to go through the PR and the accompanying comments. In the comments there is a lot to unpack. But to ensure that we keep moving forward on the right track, I will frame my response around 3 basic principles of a healthy discussion: Fairness, Transparency and Accuracy. Fairness: Open ASR Leaderboard has been in existence for 3+ years now. Getting your models listed on ASR leaderboard is a matter of prestige (see NVIDIA blog when they got listed: https://developer.nvidia.com/blog/nvidia-speech-ai-models-deliver-industry-leading-accuracy-and-performance/) and until recently (as far back as last month, when NVIDIA models launched), the same evaluation methodology is used and models got listed immediately. We have submitted our model to the same evaluation methodology and have been able to get a consistent score on all the diverse evaluation sets used by leaderboard. I do not see why the discussion on the present evaluation methodology of leaderboard should stop this PR merge. Would you agree @pzelasko @Vaibhavs10 ? Discussions on revamping the leaderboard can be done in parallel, but that exercise will take time in both theory and practice. Transparency: Like mentioned in the original PR text, Pingala model is a finetuned version of Whisper-large. We paid special attention to various training aspects, such as appropriate learning rate, only training decoder layers and maintaining diversity among our training datasets. We were overly cautious about overfitting and validated it thoroughly. The exact methodology can not be shared here as our paper is under double blinded review. Once the paper clears the review cycle, we will publicly share it. So, to answer your question @Vaibhavs10 , I do not believe that there is any chance of contamination of our training data. Accuracy: While our accuracy is very much proven by the current evaluation methodology, I want to address @pzelasko observations here. We went back to our whiteboard and evaluated each and every aspect of our model training. We could not come up with a single obvious reason of why our finetuning will result in an error of 15.5% (vs the base model used at 5.5%) on dataset C. Without getting our hands on the evaluation data and normalization script, we are not able to validate your figures and accept them. Aligning with the transparency principle, we request you to share your dataset and normalization script. But again, this should NOT hold up the PR merge. As usual, we welcome all healthy, just and transparent comments and are available to answer them. Looking forward for this PR merge ASAP. |
|
@pzelasko Shall I construe your silence as you agree with my statements? And discussions on HF revamping shall not block this PR? Please let us know if you have any other comments by Wednesday 17th. I am aiming to get this PR merged next Monday. @nithinraok Any comments from you? |
|
@Vaibhavs10 @Deep-unlearning Hi Vaibhav/Steven, Since there has not been any further comments from other members and previous comments were already addressed, I am presuming that we are ready for the PR merge first thing next week (as there has already been a considerable delay) when Steven is back in office. Please let me know if there are still any outstanding questions. |
Hello Open-ASR team,
This PR adds my Whisper-based ASR model to the leaderboard. The modification is minimal yet it represents a model that has achieved record-breaking WER in evaluation.
This model is the result of extensive fine-tuning, and deep research, which have directly contributed to its significant performance gains. The supporting paper is currently under review by scientific community and the model has already been downloaded ~2000 times on Hugging Face.
The model is continuing to gain traction, and it would be a miss for Hugging Face not to list it on the leaderboard at this point. Given the adherence to the PR submission guidelines and the demonstrated impact of the model, I urge the team to merge this PR swiftly to ensure the leaderboard reflects the latest advances and remains a credible source of truth for the ASR community.
The model has already been evaluated by the run_whisper.sh file under transformers model (file modified as part of PR) on a NVIDIA A100-SXM4-80GB machine. Here is the summary of results:
Results per dataset:
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 4.24 %, RTFx = 126.71
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 6.00 %, RTFx = 252.69
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 4.98 %, RTFx = 282.11
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 0.99 %, RTFx = 325.63
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 1.73 %, RTFx = 294.24
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 1.10 %, RTFx = 396.44
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 1.41 %, RTFx = 357.42
shunyalabs/pingala-v1-universal | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 4.31 %, RTFx = 417.02
Composite Results:
shunyalabs/pingala-v1-universal: WER = 3.10 %
shunyalabs/pingala-v1-universal: RTFx = 322.12
We also have attached the output JSONL files to this PR.
MODEL_shunyalabs-pingala-v1-universal_DATASET_hf-audio-esb-datasets-test-only-sorted_ami_test.zip
Thank you for maintaining this important benchmark and for your prompt attention. We remain at your disposal for answering any questions to ensure a swift merge and subsequent listing on Open ASR leaderboard.
PS: Please see first comment for other relevant PRs and info.