Which audio file format is best? #41

FurkanGozukara · 2022-09-22T18:23:12Z

FurkanGozukara
Sep 22, 2022

I will use it to transcribe (speech to text) and generate subtitles for my free English Software Engineering Courses (SECourses) that I publish on YouTube : https://www.youtube.com/SECourses

I have tried many free apps and all failed miserably

The best I have found was Google Speech to Text premium API which costs a lot and Google were requesting flac mono

I was using following command to extract audio from my videos for Google API

ffmpeg -i "machine learning original lecture 1 week 1.mkv" -af aformat=s16:48000:mono machine_learning_lecture_1.flac

So what do you suggest me for providing into whisper

Currently I am parsing this video and it is extremely slow : https://youtu.be/eWN4Ng08Y4U

I have used this command to extract its audio :

ffmpeg -i "How to Debug Your Python Code Properly by Using Visual Studio Community Edition 2022 org.mkv" -q:a 0 -map a python_debug.mp3

And I have used this command start whisper speech to text (I hope I am correct?)

whisper "D:\86 se courses youtube kanali\python_debug.mp3" --model medium

It looks even better than Google Speech to Text premium API so far
It is just too slow. I have 4.59 GHZ i7-10700f cpu - 16 cores

accuracy is so amazing for a non-native speaker like me thank you so much guys
just too slow 🗡️

Uploaded generated subtitle with timing that AI generated

Damn good congrats. First time ever I am seeing a not a cherry picked AI paper

Select English subtitle at the video generated by whisper and I didn't edit at all currently : https://www.youtube.com/watch?v=eWN4Ng08Y4U

Answered by drdaxxy

Sep 22, 2022

The program converts your input with ffmpeg (effectively ffmpeg -i <recording> -ar 16000 -ac 1 -c:a pcm_s16le <output>.wav) and pre-processes it before doing any speech recognition. You can just give it your video files, except when that command wouldn't work (like if you have multiple audio languages and don't want the default track).

It is just too slow. I have [...] cpu

There's your problem, this is a colossal language model... even on a RTX 3090 (high-end consumer) GPU, the medium model is only 3x-4x faster than playback time. On my CPU (AMD Ryzen 7 3700X), the small model did not finish transcribing a 1-minute sample in over 10 minutes.

View full answer

FurkanGozukara · 2022-09-22T20:29:35Z

FurkanGozukara
Sep 22, 2022
Author

Ok i have put raw medium.en generated subtitle to the video and I am simply amazed (used the timings AI generated)

Thank you so much guys this is an awesome tool

https://www.youtube.com/watch?v=eWN4Ng08Y4U

edit : i just also manually fixed and there were only 5 words that had very minor errors and in 2 cases i said "were" which was supposed to be "was" and application has written them as was lol :D

this is far better than even premium speech to text api of Google cloud services

0 replies

drdaxxy · 2022-09-22T22:53:27Z

drdaxxy
Sep 22, 2022

The program converts your input with ffmpeg (effectively ffmpeg -i <recording> -ar 16000 -ac 1 -c:a pcm_s16le <output>.wav) and pre-processes it before doing any speech recognition. You can just give it your video files, except when that command wouldn't work (like if you have multiple audio languages and don't want the default track).

It is just too slow. I have [...] cpu

There's your problem, this is a colossal language model... even on a RTX 3090 (high-end consumer) GPU, the medium model is only 3x-4x faster than playback time. On my CPU (AMD Ryzen 7 3700X), the small model did not finish transcribing a 1-minute sample in over 10 minutes.

1 reply

FurkanGozukara Sep 25, 2022
Author

Yes because of that I have purchased RTX 3060 12 GB Vram. Waiting it to arrive. Just too slow for CPU. But I am ok with slowness for better quality. That is why I liked Whisper very much. I wish they also had trained it for translation from english to other languages. I have tested it with forcing another language in google colabs but results were terrible. I am yet to test its speech recognition on foreign languages and translation from foreign language to English. Waiting my GPU to arrive.

georpat · 2022-12-09T09:21:25Z

georpat
Dec 9, 2022

Cost aside, which GPU will give the fastest performance? If I wanted to say use it to convert streaming audio in almost real-time.

3 replies

FurkanGozukara Dec 9, 2022
Author

Cost aside, which GPU will give the fastest performance? If I wanted to say use it to convert streaming audio in almost real-time.

best one in games probably. but be sure that vram capacity is enough for your target level.

georpat Dec 9, 2022

What would be the most suitable GPU to purchase if I wanted to use the large model?

stromajer Dec 21, 2023

@georpat I would buy RTX 3060 12G if you have tight budget, othervise larger memory is better (NVIDIA) AMD GPUs may be problematic sometimes.

Which audio file format is best? #41

Uh oh!

Uh oh!

FurkanGozukara Sep 22, 2022

Replies: 3 comments · 4 replies

Uh oh!

Uh oh!

FurkanGozukara Sep 22, 2022 Author

Uh oh!

drdaxxy Sep 22, 2022

Uh oh!

FurkanGozukara Sep 25, 2022 Author

Uh oh!

georpat Dec 9, 2022

Uh oh!

FurkanGozukara Dec 9, 2022 Author

Uh oh!

georpat Dec 9, 2022

Uh oh!

Uh oh!

stromajer Dec 21, 2023

FurkanGozukara
Sep 22, 2022

Replies: 3 comments 4 replies

FurkanGozukara
Sep 22, 2022
Author

drdaxxy
Sep 22, 2022

FurkanGozukara Sep 25, 2022
Author

georpat
Dec 9, 2022

FurkanGozukara Dec 9, 2022
Author