Zero-shot Audio Classification using Whisper #673

jumon · 2022-12-13T08:47:35Z

jumon
Dec 13, 2022

Hi there!

I have found Whisper to be good at recognizing environmental sounds without fine-tuning, so I wrote a code to conduct zero-shot audio classification using Whisper.

Code: https://github.com/jumon/zac
Demo: https://huggingface.co/spaces/Jumon/whisper-zero-shot-audio-classification

If you are interested, give it a try and let me know what you think.

I have evaluated the code for zero-shot environment sound classification on the ESC-50 dataset, which contains 2000 audio samples from 50 classes (40 samples per class), and it achieved 31.8% accuracy. Since the accuracy of random prediction is 2%, I think the result is not bad.

erdeme36 · 2022-12-19T13:46:46Z

erdeme36
Dec 19, 2022

Can we add more classes on it ?

5 replies

jumon Dec 20, 2022
Author

Yes! You can add any class name you'd like to class_names.txt, as this is a zero-shot classification system that is based on the natural language of class names. However, please keep in mind that rare class names may not be recognized as accurately.

erdeme36 Dec 20, 2022

Thanks for the fast response. Btw It was awesome. Thanks for sharing. I just want to ask this question. Can i get audio classification time-stamps ? such as 00:00:02,3 - 00:00:04,7 [applause] ? Thanks for helping

jumon Dec 20, 2022
Author

Unfortunately, my code is unable to provide timestamps because it only calculates the probability of each class name text that Whisper generates for a given audio. If you want to get timestamps with transcriptions that include tags such as "[applause]", you could try using Whisper's default transcribe function. Keep in mind that this function suppresses the generation of tags like "[]" (you can view this in the code here). You will need to modify the code a bit to output text with tags like "[applause]". However, I do not believe the results of this approach would be reliable...

erdeme36 Dec 21, 2022

I coundnt find the Whisper's default transcribe function. Which one you talking about ? My duty is to get explosion, applause etc with a time-stamps in order to show them as subtitle. Thanks for helping

jumon Dec 21, 2022
Author

This one

whisper/whisper/transcribe.py

Line 19 in 0b5dcfd

def transcribe(

thelou1s · 2023-10-25T11:41:15Z

thelou1s
Oct 25, 2023

Not Good, at least for my 10 audios

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Zero-shot Audio Classification using Whisper #673

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Zero-shot Audio Classification using Whisper #673

Uh oh!

jumon Dec 13, 2022

Replies: 2 comments · 5 replies

Uh oh!

erdeme36 Dec 19, 2022

Uh oh!

jumon Dec 20, 2022 Author

Uh oh!

erdeme36 Dec 20, 2022

Uh oh!

jumon Dec 20, 2022 Author

Uh oh!

erdeme36 Dec 21, 2022

Uh oh!

jumon Dec 21, 2022 Author

Uh oh!

thelou1s Oct 25, 2023

jumon
Dec 13, 2022

Replies: 2 comments 5 replies

erdeme36
Dec 19, 2022

jumon Dec 20, 2022
Author

jumon Dec 20, 2022
Author

jumon Dec 21, 2022
Author

thelou1s
Oct 25, 2023