Skip to content

How to do Free Speech to Text Transcription Better Than Google Premium API with OpenAI Whisper Model

FurkanGozukara edited this page Oct 28, 2025 · 1 revision

How to do Free Speech-to-Text Transcription Better Than Google Premium API with OpenAI Whisper Model

How to do Free Speech-to-Text Transcription Better Than Google Premium API with OpenAI Whisper Model

image Hits Patreon BuyMeACoffee Furkan Gözükara Medium Codio Furkan Gözükara Medium

YouTube Channel Furkan Gözükara LinkedIn Udemy Twitter Follow Furkan Gözükara

If you want to transcribe your videos and audio into text for free but with high quality, you have come to the correct video.

In this tutorial video, I will guide you on how to use #OpenAI #Whisper model. I will show you how to install and run Open AI's Whisper from scratch. I will demonstrate to you how to convert audio/speech into text.

Whisper is a general-purpose speech recognition model released for free by Open AI. I claim that Whisper is the best available Speech-to-Text model (Natural Language Processing - #NLP) released to public usage including premium paid ones such as Amazon Web Services, Microsoft Azure Cloud Platform, or Google Cloud API. And Whisper is free to use.

I will show you how to install the necessary Python code and the dependent libraries. I will show you how to download a video from YouTube with YT-DLP, how to cut certain parts of the video with LosslessCut, and how to extract the audio of a video with FFMPEG. I will show you how to do a transcription of a video or a sound. I will show you how to generate subtitles for any video. Finally, I will show you how to generate translated transcription and subtitles of any language video.

With the translation feature of the Whisper model, you can watch any language (Whisper supports 99 languages) with English subtitles. Let's say you can find English subtitles for your favorite video in German or Japanese or Arabic. It is not a problem. Just follow my tutorial and generated English translated subtitles.

Actually, to be precise, Whisper is able to transcribe speech to text in all the following languages, and therefore, translation of these following languages into English:

{af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,hi,hr,ht,hu,hy,id,is,it,iw,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}

The links and the commands I have shown in the video below:

Open AI Whisper : https://openai.com/blog/whisper/

Whisper Code : https://github.com/openai/whisper

Python : https://www.python.org/downloads/release/python-399/

Whisper install : pip install git+https://github.com/openai/whisper.git

How to install CUDA support for using GPU when doing transcription of audio :

First, delete existing Pytorch : pip3 uninstall torch

Then install Pytorch with CUDA support : pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

FFMPEG : https://github.com/BtbN/FFmpeg-Builds/releases

LosslessCut : https://github.com/mifi/lossless-cut/releases

How to extract sound of any video with FFMPEG : ffmpeg -i "test_video.webm" -q:a 0 -map a test_video.mp3

How to transcribe an English video : whisper "C:\speech to text\test_video.mp3" --language en --model base.en --device cpu --task transcribe

How to transcribe an English video with CUDA support : whisper "C:\speech to text\test_video.mp3" --language en --model base.en --device cuda --task transcribe

How to transcribe a Turkish video : whisper "C:\speech to text\test_video.mp3" --language tr --model base.en --device cpu --task transcribe

How to transcribe a Turkish video with translation : whisper "C:\speech to text\test.mp3" --language tr --model small --device cuda -o "C:\speech to text" --task translate

Our Discord for SECourses : https://discord.gg/rfttctFewW

If you are interested in programming but you lack experience and skills I suggest you watch our playlists: https://www.youtube.com/c/SECourses/playlists

[1] Introduction to Programming Full Course with C# playlist

[2] Advanced Programming with C# Full Course Playlist

[3] Object Oriented Programming Full Course with C# playlist

[4] Asp.NET Core V5 - MVC Pattern - Bootstrap V5 - Responsive Web Programming with C# Full Course Playlist

[5] Artificial Intelligence (AI) and Machine Learning (ML) Full Course with C# Examples playlist

[6] Software Engineering Full Course playlist

[7] Security of Information Systems Full Course playlist

Video Transcription

  • 00:00:00 Hello everyone. Welcome to the Software Engineering Courses channel. I am Dr. Furkan Gözükara.

  • 00:00:07 Today I will be presenting to you the ultimate guide for speech-to-text transcribing for

  • 00:00:12 free on any Windows operating system. I will be using Windows 10 for demonstration. We

  • 00:00:18 will use Whisper, which is a general-purpose speech recognition model released for free

  • 00:00:23 by OpenAI. Whisper has been released to the public two days ago. OpenAI is an artificial

  • 00:00:30 intelligence company. To download Whisper, just type Google Whisper OpenAI as can be

  • 00:00:35 seen here. Let me show you. And there is blog and GitHub repository. I will explain to you

  • 00:00:46 from scratch how to transcribe a video. Moreover, I will show you how to generate subtitles

  • 00:00:52 for any video in any of the major languages spoken in the world. If we be precise, Whisper

  • 00:00:59 supports 99 languages. Furthermore, I will demonstrate to you how to also generate speech-to-text

  • 00:01:06 translation to English from any of the supported languages. So let's say you have a Japanese

  • 00:01:12 TV series and you can't find English subtitles for these TV series. With Whisper, you can

  • 00:01:18 easily generate English subtitles to your favorite TV series with correct timing without

  • 00:01:23 any hassle and without paying a cent. Also, let's say you are a video producer and you

  • 00:01:29 are a non-English speaker. Then to your videos, you can easily generate subtitles and also

  • 00:01:36 English translated subtitles as well. I have already tested and compared the model of Whisper

  • 00:01:43 to existing both free and paid speech transcription and recognition services. The Whisper model

  • 00:01:50 is so good that it is better than the automatic subtitle generation system of YouTube or even

  • 00:01:55 better than the premium paid speech-to-text cloud API service of Google. Yes, I have used

  • 00:02:01 both services in my previous videos. Therefore, I know what they are capable of and what Whisper

  • 00:02:08 is capable of. For example, I used Whisper to generate subtitles for my previous video.

  • 00:02:15 Let me show you that. How to debug your Python code properly by using Visual Studio Community

  • 00:02:31 Edition 2022. You can select the Whisper generated subtitle from here as I am showing you right

  • 00:02:40 now. English, United States, English untouched by OpenAI. I didn't change even a letter in

  • 00:02:48 this subtitle and it is so good. Whisper made only 5 minor wording errors because I also

  • 00:02:57 uploaded the manually fixed version of the subtitle generated by Whisper. I will also

  • 00:03:05 upload Whisper generated subtitles for this video right now I am recording as well for

  • 00:03:12 you to check out and evaluate yourself. Okay, so let's start with installing all the necessary

  • 00:03:19 files. First, we need to download Python because Whisper runs on Python. They mentioned that

  • 00:03:28 they have used Python 3.9.9. Let me find it. Okay, here. By the way, to access OpenAI GitHub

  • 00:03:42 repository, you can click code on their blog and you can also read their paper if you are

  • 00:03:48 interested in how they have developed. They have used 680,000 hours of multilingual and

  • 00:03:59 multitask supervised data. Supervised means that they are labeled, they have manually

  • 00:04:07 generated subtitles or they are manually transcribed. This is a huge task and this requires huge

  • 00:04:17 hardware computation power. I thank them for releasing this to the public for free. Okay,

  • 00:04:27 so as can be seen in the setup of the GitHub folder, they have used Python 3.9.9 and PyTorch

  • 00:04:38 and other things. I will download and install everything in these videos so that it will

  • 00:04:44 be ultimately guide for you. Download Python 3.9.9. Okay. All right, I will install Windows

  • 00:04:57 installer 64-bit to download. Okay, download has been completed. I will install it as an

  • 00:05:06 administrator. Okay, customize installation. You see I am picking everything. Next, I am

  • 00:05:15 picking these options as well and I will change the working directory to new folder in a C

  • 00:05:25 folder. I will name it as Python 3.9.9. Okay. And I will pick that folder as you can see

  • 00:05:37 here and install. Okay, so the installation has been successful. I will pause. Meanwhile,

  • 00:05:45 I am downloading or installing something. Okay, so what else we need? Then we need to

  • 00:05:53 install pip. Actually, it is already installed in my system. But if you want to install pip,

  • 00:06:00 just download pip like this. And here it will give us the command. I will open CMD, we will

  • 00:06:13 be working a lot of with command prompt. When I click pip install pip, it says it's already

  • 00:06:19 installed. Okay. Okay, it has been installed with the Python package. Then we need to run

  • 00:06:28 this command, as you can see, which is posted on GitHub folder. By the way, I will put into

  • 00:06:35 description of the video every command and every link that I am using here. So don't

  • 00:06:41 worry about that. Copy it. And let's just run in our command prompt. Okay. Okay, so

  • 00:06:52 the installation of the Whisper has been completed successfully, as can be seen here. Now we

  • 00:06:58 can start transcribing our speech to text transcribing speech to text videos. Okay,

  • 00:07:10 so I will download a video of mine, and I will extract its audio for you. I will show

  • 00:07:16 you and then I will show you how to use open AI. Alright, so let's download my latest video

  • 00:07:26 to download my video from YouTube. We will use YouTube DLP. Just type YouTube DLP like

  • 00:07:36 this. It's an open-source project. And let's download its release like this. Okay. Okay,

  • 00:07:49 I think this one. Okay, it has been downloaded, I will put this into my folders in C, which

  • 00:08:04 I have named it as speech-to-text like this. So I am opening another command prompt, I

  • 00:08:11 am moving into C and then I'm moving into speech to text I have typed S P E and then

  • 00:08:21 I hit tab button on my keyboard, it auto completes. Okay, now then Yt-DLP exe, which I also

  • 00:08:31 use the tab and auto-complete. I copy and paste video link. And I right click and it

  • 00:08:40 pastes it into command prompt. Okay, downloading. Meanwhile, let's also download FFmpeg, which

  • 00:08:51 we will be using for extracting audio to download FFmpeg. Click download, pick windows. And

  • 00:09:04 yes, the second link windows built by BTBN, it is better one, I will download the biggest

  • 00:09:16 size having one which is the not the Linux or this one. Okay, it is downloading. And

  • 00:09:27 let's see, okay, our video had been downloaded into our folder as you can see here. Okay.

  • 00:09:38 And YouTube DLP also had been downloaded. Let's also copy and paste it into our folder and

  • 00:09:47 extract it. Okay, I need these three exes. Okay. And the video is, let's see 16 minutes.

  • 00:10:02 Yeah, it's a decent time I think we can work on that but we can also cut it and work on

  • 00:10:09 a small part of it to cut a video I use another open source project which is a great one.

  • 00:10:16 So you will also learn this in this video LosslessCut. Okay, LosslessCut is also another

  • 00:10:23 open source project. This is the link of it. And let's just download its release here.

  • 00:10:35 And let's pick the correct file which is this one I think. Okay, it is getting downloaded.

  • 00:10:47 These are small files. Yeah, oh, I need to download this one not this one actually. Okay,

  • 00:10:59 it's almost ready. Okay, it's ready. Download it. Let's also cut and paste it into

  • 00:11:07 our folder extracted. Okay, I'm opening LosslessCut. Then I will open the downloaded video,

  • 00:11:19 or just drag and drop it like here. Then I will cut its first two minutes. Okay, so it

  • 00:11:28 will be fast. Okay, currently it says this part will be saved and this part will be ignored.

  • 00:11:37 I did set it like this export. And here I will name it as our test video like this.

  • 00:11:48 It is saved as testvideo.webm because we have downloaded it. Now time to extract

  • 00:11:55 its audio to feed it into Whisper. Okay. So the command is here. I have already written

  • 00:12:07 them. It is FFmpeg-i the video file name and the output file name will be like

  • 00:12:18 this. Okay, let's run it before running it. Let's open another command prompt and move

  • 00:12:23 into our folder. This time CD and drag and drop the folder like this and hit enter and

  • 00:12:31 we are there. When you type the DIR you can see its content and copy and pasting

  • 00:12:39 the copy text. Okay, oh, we have a naming error. Okay, let's fix it. Okay, it is done.

  • 00:12:55 And now our test video and page file is ready. Time to transcribe it. Okay. So let's by the

  • 00:13:06 way by when you default install Whisper, it only supports CPU running. But if you have

  • 00:13:16 a GPU that supports CUDA, then you can also use your GPU for speech-to-text transcribing.

  • 00:13:28 So first I will show you with CPU then I will show you with GPU as well. Okay, so our command

  • 00:13:38 will be as like this I will provide language as well. The language of this video is English

  • 00:13:49 therefore it will be like this. Okay, so there is model small. What does that mean is they

  • 00:13:56 released, several models, here tiny base, small, medium, and large. For English they

  • 00:14:04 say that English only model works better. And if you have time, I suggest you to use

  • 00:14:12 biggest one which is medium English only model. And for multilingual if you are going to transcribe

  • 00:14:21 a video that is other than English language I suggest you to use large model. And these

  • 00:14:26 are the video RAM requirements. If you use your GPU by the way GPU GPU is much faster

  • 00:14:34 than CPU. Okay, it is many many times faster than CPU. Currently, I am running over 24 hour

  • 00:14:43 CPU transcribing in another computer and it only transcribed like four hours of speech

  • 00:14:53 on CPU. Therefore I already I just purchased another graphic card today which has 12 GB VRAM

  • 00:15:00 memory and I will use Whisper to generate and improve subtitles of my existing other

  • 00:15:09 lecture videos as well. Okay, so let's start with base model which is a decent one and

  • 00:15:16 it should work fast. So therefore I am going to provide model base.en I will select

  • 00:15:26 the device as CPU currently only CPU is available and the task will be transcribed by the way

  • 00:15:37 if you don't provide any task it will be by default transcribe if you don't provide any

  • 00:15:41 device it will be by default CPU if you don't provide any model it will be by default let's

  • 00:15:48 check it out with Whisper --help the default model will be small okay and if

  • 00:16:03 you don't provide any language it will try to detect the language of the provided audio

  • 00:16:09 so these are the defaults so moreover also you need to move into the folder that you

  • 00:16:15 are going to transcribe otherwise it won't save it last time I have tried so let's just

  • 00:16:25 run our code which is this one okay it's going to start okay I will pause video okay it has

  • 00:16:41 started converting speech into text as you can see hello everyone welcome to my channel

  • 00:16:53 again this is Dr. Furkan Gözükara it failed to understand my name it is Turkish and our

  • 00:17:02 model is not the best one when you transcribe a video with a better model a bigger model

  • 00:17:12 it does a better job for sure I have tested it if you have a GPU it is much faster many

  • 00:17:20 times faster my computer is also strong I have you see I have core i7 10700 F CPU which

  • 00:17:38 has 16 cores 8 real core and 8 logical core it runs at 4.59 gigahertz you see it is using

  • 00:17:48 100% CPU right now okay the transcribing has finished because we have used a small model

  • 00:17:56 and a small video this is the transcription it is pretty good for this model and let's

  • 00:18:07 open our folder and as you can see the transcribe is showing with EM editor okay the transcribe

  • 00:18:18 is here and the generated subtitle file is also here this is directly working in YouTube

  • 00:18:31 I have tested it because it has the correct timestamps for the sentences it is awesome

  • 00:18:38 believe me and it is working okay so we have successfully transcribed an English video

  • 00:18:51 and we have generated subtitles for that video let's test our subtitle I am opening with

  • 00:19:05 media player classic it automatically loaded the subtitle because it is in the same folder

  • 00:19:12 with the same name you see it is awesome if we use the best

  • 00:19:34 model I am sure those minor mistakes will also get fixed okay now time to show you how

  • 00:19:44 to use GPU on Windows installation to be able to use GPU first we need to delete installed

  • 00:19:54 torch okay okay let's run the command okay it says these are depending on the torch I'm

  • 00:20:07 just saying yes and it is uninstalled then we install the latest torch to get this command

  • 00:20:17 you can just go to torch just type torch download PyTorch actually PyTorch yes just type

  • 00:20:28 PyTorch then you can pick the versions here stable LST preview your operating system your

  • 00:20:37 CUDA version python c++ and it is giving it is going to give you a download link okay

  • 00:20:52 and I'm going to select pip and this is it okay this is the same link that I have just

  • 00:20:59 copied and just click install okay it is going to download all the necessary files and install

  • 00:21:07 it I think this was over two gigabytes if I remember correctly it says collecting torch

  • 00:21:15 I'll just pause okay so it has been installed successfully now we can also set the device

  • 00:21:27 with our transcribe method so I am going to change this into give you just copy and paste

  • 00:21:34 it and change this and let's see its speed okay so I will delete the older files I'm

  • 00:21:44 not sure if it if it will override or not therefore I am deleting them and let's just

  • 00:21:50 click enter okay and it says that oh this was not happening I think since I uninstalled

  • 00:22:06 and installed again there is a problem okay I have found the error no matter how

  • 00:22:20 senior we get we still make such minor mistakes but they are taking our time to figure out

  • 00:22:27 you see I have typed GPU as a device but it should be CUDA so when I write it as

  • 00:22:36 a CUDA now it will work let's delete the existing file and let's run the command as

  • 00:22:47 CUDA and now you will see how fast it works it has an initialization period like this

  • 00:22:56 and then it is super fast okay as you can see it's working okay and one final thing

  • 00:23:17 is that I will show you how to do translation okay for translation I will use one of my

  • 00:23:27 Turkish videos okay I also have some Turkish lectures for example here and let's download

  • 00:23:40 some of the short one like the latest one here okay and let's get the code we will just

  • 00:23:49 run okay one second let's move into our folder yt and dlp by the way for translation to

  • 00:24:02 work you should be in the same folder probably I'm not sure but always be in the same folder

  • 00:24:10 is better it is getting downloaded okay so the download was taking too long so I decided

  • 00:24:20 to download only the audio file and you can do that with yt dlp so let's just copy the

  • 00:24:32 link again and we are going to add minus bigger uppercase F and it will give us all the options

  • 00:24:43 as you can see so I will just download the audio file which is let's see audio only let's

  • 00:24:56 download the best one so the best audio is this one I think yes or the best one is this

  • 00:25:13 one so let's just give the command with like this yes now it will download only the audio

  • 00:25:28 of the video but I am not sure if the Whisper is supporting this video format it could be

  • 00:25:44 a problem it is also taking time okay so the download has been completed let's test whether

  • 00:26:12 the whisper is able to utilize M4A sound so I will name this as lecture 14 TR okay and

  • 00:26:36 let's try it okay I won't give output folder I will give also output folder yes I will

  • 00:26:44 use device CUDA and first let's try with transcribing oh by the way we should cut it probably I

  • 00:27:01 wonder if LosslessCut can cut it yeah probably let's cut the first three minutes oh it can't

  • 00:27:15 cut it okay we need to cut this okay so this is the command to cut an audio file with ffmpeg

  • 00:27:37 it is cut I think immediately because we didn't re-encode yes here now we can use this short

  • 00:27:45 file which is 180 seconds okay so first we will transcribe it then we will translate

  • 00:27:56 it this is in Turkish by the way it is not downloading the model because I have already

  • 00:28:07 downloaded it and it uses the cache but it automatically downloads it if you don't have

  • 00:28:18 it in the cache if you don't have the model in the cache so it is not a problem okay it

  • 00:28:24 is still initializing I think yeah we can see the RAM GPU VRAM memory usage okay you

  • 00:28:34 see currently, it is printing the lecture it is in Turkish so it is printing in Turkish

  • 00:28:42 it can be any language one of the supported language let's let me show you the full of

  • 00:28:47 the supported languages okay let's open command prompt type wish first type negative negative

  • 00:28:56 and help yeah and these are all the languages languages that it supports it both supports

  • 00:29:11 the language code or the full language name like africans albanian amharic arabic armenian

  • 00:29:20 azerbaijani basque dutch english danish estonian finnish I think there are 99 languages okay

  • 00:29:33 so it is still processing I think non-english processing is a little bit slower than english

  • 00:29:42 itself by the way it is supporting m4a sound as well okay it should get done in a minute

  • 00:29:56 yeah okay we have cut it as three minutes okay so it has been completed now time to

  • 00:30:11 translate we can see the generated file here like this okay I will just delete them and

  • 00:30:22 now let's run the translation command which will be like this okay okay now it will translate

  • 00:30:37 this text into English unfortunately it only supports translation to English from other

  • 00:30:45 languages it doesn't support translation from non-English language. Translation from English

  • 00:30:53 to non-English languages that would be super awesome if they were supporting however they

  • 00:31:00 do not support that okay so the first sentence uh this one is translated as this one I think

  • 00:31:10 it is decent but not the best because we are using only the small model uh with the big

  • 00:31:17 model I am pretty sure large model I am pretty sure we would have much better translation

  • 00:31:25 and transcription and speech-to-text generation okay and one final thing is that they are

  • 00:31:41 updating the source code time to time you see latest commit was 12 hours ago so you

  • 00:31:50 should update your code time to time you can do that with code download a zip it is downloaded

  • 00:32:01 and then go to Whisper folder then this Whisper folder is located in python then lib and then

  • 00:32:19 I think it is inside let me find it okay inside site-packages folder and then there is a Whisper

  • 00:32:37 folder just drag and drop replace and okay it's updated they are fixing errors actually

  • 00:32:49 they have fixed the translation writing into file error recently so you you really should

  • 00:32:58 pay attention to latest commit uh after your first initial download okay this is all uh

  • 00:33:08 I appreciate if you join and subscribe my channel okay sorry about that sorry about

  • 00:33:16 that and hopefully see you later end of the video uh I am waiting your comments opinions

  • 00:33:24 and questions I also answer your questions you can ask through our discord and if you

  • 00:33:30 wonder where to find our discord you can join our discord with the link here or from here

  • 00:33:37 or many of my videos have um discord link actually I didn't put this on but in many

  • 00:33:47 of them there are discord link okay see you

Clone this wiki locally