Generate the phonetical transcription/tokens rather than the final words #1875

marceltud · 2023-12-06T17:48:46Z

marceltud
Dec 6, 2023

Hi,

I've tried to read a bit about how whisper works, but I still feel that I'm a novice as compared to some of the experts.
Is there a way to generate with whisper a sort of phonetical transcription, meaning not the actual words but a textual representation of the sounds. In the International Phonetic Alphabet (IPA) each sound has it's associated character.
IPA it's just an example, I would be happy to generate any characters that are uniquely identifying the sounds uttered.

Based on what I could read from the whisper documentation, those would be called tokens, but I'm not sure whether there are several types of tokens. I'm sure that at the end of the process the "tokens" get mapped to the actual words from the dictionary, and what I would want to achieve is to be able to print these "tokens" before they are transformed into actual words.

I guess that for testing the models to see where things go wrong, this would be used/useful.

Could anyone of you help me understand where in the code something needs to be changed. If ever you have the time to show me how it is done it would be greatly appreciated. Otherwise at least pointing to the relevant part that needs to be changed would be great.

Thank you very much

glangford · 2023-12-06T18:25:40Z

glangford
Dec 6, 2023

FYI, see these previous discussions

2 replies

marceltud Dec 6, 2023
Author

Thank you very much glangford.
I believe that my question is different, let me explain.

In the answer to the thread with "Transcribe to IPA" the phonemizer solution is mentioned.
https://github.com/bootphon/phonemizer
Based on what I could read the phonemizer does text to speech.

What I am searching for is to understand how whisper is working and to capture whatever whisper generates for a particular text before it is mapped to an word.
Let's say that I run the large-v3 model to transcribe 30 min of audio from Romanian for example.
Well the result is quite poor for the audios that I want to use, and I was considering capturing the way whisper is interpreting the phonems and before it is generated into words e.g. a series of "tokens" that are representing somehow the phonems that are being pronounced in the input audio file.

Whereas the answer with phonemizer would mean that whisper generates the text from the audio, and then the phonemizer is translating it into IPA or SAMPA format.

In certain languages e.g. Romanian the WER of whisper is quite high, and I'm trying to skip the last step of whisper to "translate" the internal representation of a word into the actual text of that word.

I need to be able to search in that file, and if the file is represented let's say in another format e.g. binary, there must be a way to map that format to words at the end, and if I could capture that format. for me would be sufficient. Then I could use the same format for transforming my keyword that I want to search in that same format and then I would search for it in the overall file that was generated by whisper.

This way I would eliminate the errors that are done in the last step of the "translation" from the internal format of whisper to the final words.
I've read that the terms use are encoding and decoding, however would you have any examples of a particulat audio file that goes to all the stages of the transformation, until the text? That might help me better understand.

Many thanks for your kind help!

. I need to be able to search in that file, and I can search using the same

EtienneAb3d Dec 7, 2023

What I am searching for is to understand how whisper is working and to capture whatever whisper generates for a particular text before it is mapped to an word.
Let's say that I run the large-v3 model to transcribe 30 min of audio from Romanian for example.
Well the result is quite poor for the audios that I want to use, and I was considering capturing the way whisper is interpreting the phonems and before it is generated into words e.g. a series of "tokens" that are representing somehow the phonems that are being pronounced in the input audio file.

Whisper is not using phonemes, but directly produces tokens = groups of letters (being words, or part of words).
;-)

glangford · 2023-12-07T02:34:54Z

glangford
Dec 7, 2023

I was considering capturing the way whisper is interpreting the phonems and before it is generated into words
...
This way I would eliminate the errors that are done in the last step of the "translation" from the internal format of whisper to the final words.

Not entirely sure I fully understand where you are going, but here is a suggestion or two and perhaps someone more expert can step in if needed:

Have you read the Whisper paper, especially 2.2 and 2.3?
https://arxiv.org/pdf/2212.04356.pdf
You can see all the references to tokens...so what is a token exactly? You can try, for example

from whisper.tokenizer import Tokenizer, get_tokenizer
t = get_tokenizer(False)
t.decode( [3999] )
t.decode( [9827] )

So the "internal format of whisper" you are referring to is the stream of tokens that are output from the model when processing an audio file; but note that these individual output tokens are not phonemes, they are elements of a special "vocabulary".

eg.

Why the vocab size is 51865? #361

0 replies

marceltud · 2023-12-07T05:09:15Z

marceltud
Dec 7, 2023
Author

First of all I really appreciate your patience, as I am not an expert and you certainly know more than I do.

My question is about non-English languages and it concerns only the transcription, not the language detection nor translation.

Not entirely sure I fully understand where you are going, but here is a suggestion or two and perhaps someone more expert can step in if needed:

Have you read the Whisper paper, especially 2.2 and 2.3?
https://arxiv.org/pdf/2212.04356.pdf

I've read both chapters twice.

You can see all the references to tokens...so what is a token exactly? You can try, for example
from whisper.tokenizer import Tokenizer, get_tokenizer
t = get_tokenizer(False)
t.decode( [3999] )
t.decode( [9827] )

I tried it, and it felt like my first "Hello world"
the output was " Batman"
I played with other values.

So the input to the decoder are tokens which are words, are not phonems nor syllables they have the length of a word.
And then I read that the decoder does many other things "The decoder uses learned position embeddings and tied input-output token representations (Press & Wolf, 2017)."

So the "internal format of whisper" you are referring to is the stream of tokens that are output from the model when processing an audio file; but note that these individual output tokens are not phonemes, they are elements of a special "vocabulary".

For a particular short audio I would like to see the result of the encoding and before the decoding, where in the code would I need to change something? Thus no decoding just the result after the final step of the encoding.
I would run my code like this:
!whisper input_romanian.mp3 --language ro --model medium

Thank you very much

2 replies

glangford Dec 7, 2023

See here for example in transcribe(), this does tokenizer.decode() of tokens before returning. So you could inspect all_tokens.

whisper/whisper/transcribe.py

Line 377 in e58f288

return dict(

glangford Dec 7, 2023

...maybe a standalone example will be helpful also for the case of the Romanian language, to see what tokens correspond to:

from whisper.tokenizer import Tokenizer, get_tokenizer
t = get_tokenizer(multilingual=True, language='ro', task='transcribe')
print(t.encode('măsline'))

marceltud · 2023-12-07T22:44:54Z

marceltud
Dec 7, 2023
Author

from whisper.tokenizer import Tokenizer, get_tokenizer
t = get_tokenizer(multilingual=True, language='ro', task='transcribe')
print(t.encode('măsline'))

Thank you very much.
I ran the code that you provided and it's interesting to see the result:
[76, 4089, 82, 1889]

And then I ran the following code to see what 76, 4089,1889 represent. I guess that these are the tokens. Sometimes they represent one letter, other times a sequence of letters.

from whisper.tokenizer import Tokenizer, get_tokenizer
t = get_tokenizer(multilingual=True, language='ro', task='transcribe')
print(t.encode('măsline'))
print(t.encode('masă'))
print(t.encode('mașină'))
print(t.decode( [76] ))
print(t.decode( [4089] ))
print(t.decode( [82] ))
print(t.decode( [1889] ))
print(t.decode( [501] ))
print(t.decode( [3799] ))
print(t.decode( [1696] ))
[76, 4089, 82, 1889]
[3799, 4089]
[1696, 10355, 259, 4089]
m
ă
s
line
ack
mas
ma

Very interesting!
I'll explore a bit more and then come back.
For the moment I have a dilema:
I am running whisper on google colab. There I install whisper by
!pip install git+https://github.com/openai/whisper.git
and then when I run whisper like this
!whisper romanianinputhile.mp3 --model medium --language ro
only at this stage the medium model is being installed into my google colab notebook. And I thought that all these tokens are loaded/mapped on a particular model.

This is why I'm surprised to see the results of the above code that know that the token with id 1889 for Romanian coresponds to "line". I was expecting this result only after I would have loaded at least one model... Or maybe the model is needed for the encoding part but not for the decoding?

0 replies

marceltud · 2023-12-07T22:48:05Z

marceltud
Dec 7, 2023
Author

See here for example in transcribe(), this does tokenizer.decode() of tokens before returning. So you could inspect all_tokens.

whisper/whisper/transcribe.py

Line 377 in e58f288

return dict(

I can't wait to run this. I'll come back after some days, as I'm now learning how to clone the whisper git to my google colab where I need to change a bit a command line on line 377 as you mentioned above to see the intermediary output. Ideally I would like to inspect some variables as well and have a breakpoint at that line 377. I'm exploring how to do this on google colab and then I'll come back with my revelations :), most likely what you very well know already.

I'm very excited about it!
Thank you very much for your help.

0 replies

phineas-pta · 2023-12-07T23:30:19Z

phineas-pta
Dec 7, 2023

if u thinking about manually correct output of a LLM like whisper, pretty sure it's a wrong way: u trying to tinkering individual values of neural nets of millions/billions parameters

the correct & pratical way is to fine tune whisper on a decent audio dataset of romanian language

5 replies

marceltud Dec 12, 2023
Author

Hi,

I'm trying to add a breakpoint in the transcribe.py so that I can inspect the variables and understand better. I'm enjoying the learning process even though it is taking me some time, since I need to learn how python works and how google colab works.

I've managed to run
!git clone https://github.com/openai/whisper
and I got the whisper code locally on my google colab.
Now the next step is to run !whisper audiofile.... or if not possible python transcribe.py and then add a breakpoint somewhere in the transcribe.py. I read that this is possible on google colab.
If you have an advice how to achieve this I'd be happy to take it. Otherwise I'll post here, the next step in my whisper adventure.

Thank you very much for your support and patience.

marceltud Dec 12, 2023
Author

You are right, and I've watched 2-3 times the youtube film on how to fine-tune the Lithunaian language, and I've read the document of Sanchit on huggingface and I also watched a few other youtube clips on the same topic.

I've seen that there are other topics here opened about the fine-tunning, I want to read at least 2-3 before opening anothre topic.
What I find very strange, is that it is so overly complicated to do the fine-tunning, in theory it should be only one script with two parameters: one the audio file and second the correct transcription to train the model...

I find it interesting and useful to understand more how whisper works, this is why debugging the code should help.

I've understood so far that the tokens can exist independently of the audio file, but I still do not understand whether someone defines them or it's the machine learning that generates it. In the previous example, with măsline and masă that the way the word is broken down is very strange, sometimes letter by letter, other times 4 letters at a time....

I do not understand yet what a dictionary is...

The most important is that I'm enjoying the journey.

Thank you very much for your help

phineas-pta Dec 12, 2023

in theory it should be only one script with two parameters: one the audio file and second the correct transcription to train the model

unfortunately no it's not something like a machine with a big button and u press it no

if u want to dwell in neural nets and everything, better start learn from basics like python and machine learning and then go up from there, not from tinkering output of neural net then go down to python dict

marceltud Dec 30, 2023
Author

I followed your advice and started checking how to fine-tune whisper, but it's still very complicated. It's such a pity that the the output of a whisper transcription e.g. a srt file cannot be , after manual corrections used as the input for fine-tunning. I'm struggling right now to fine-tune the whisper model unsuccessfully. In theory I would need to cut the audio files into small 30 second files and add the associated transcriptions into a pre-defined format that I do not fully understand now. I posted a dedicated question here: #1927

I like to understand though how something works.

I understood that the tokenizer is actually separate from the encoder. What is still unclear to me is whether via fine-tunning the tokenizer is further improved? and how does the tokenizer decide to split the content of a word or not e.g. sometimes you have one letter as one entry in the tokenizer, other-times a syllable and others a whole word...

phineas-pta Dec 30, 2023

not sure why u keep questioning about tokenizer, it won't help, u cannot change it unless re-train whisper from scratch

tokenizer is a part of the model but not belong to the neural nets, it's just a conversion between human-readable data and neural-nets-readable data, it's not created with neural nets, u can dwell in NLP to know more, but again it won't allow u to tinker whisper whatsoever

data preparation is tedious but it's directly operational skills as u learn python to automate, unlike abstract theoretical knowledge about tokenizer

for fine tuning u should try search for a ready-to-use colab/kaggle notebook rather than setup a complex project

Generate the phonetical transcription/tokens rather than the final words #1875

Uh oh!

Replies: 6 comments · 9 replies

Uh oh!

Uh oh!

Uh oh!

marceltud Dec 6, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marceltud Dec 7, 2023 Author

Uh oh!

Uh oh!

Uh oh!

marceltud Dec 7, 2023 Author

Uh oh!

marceltud Dec 7, 2023 Author

Uh oh!

Uh oh!

marceltud Dec 12, 2023 Author

Uh oh!

marceltud Dec 12, 2023 Author

Uh oh!

Uh oh!

marceltud Dec 30, 2023 Author

Uh oh!

Replies: 6 comments 9 replies

marceltud Dec 6, 2023
Author

marceltud
Dec 7, 2023
Author

marceltud
Dec 7, 2023
Author

marceltud
Dec 7, 2023
Author

marceltud Dec 12, 2023
Author

marceltud Dec 12, 2023
Author

marceltud Dec 30, 2023
Author