Replies: 6 comments 9 replies
-
FYI, see these previous discussions |
Beta Was this translation helpful? Give feedback.
-
Not entirely sure I fully understand where you are going, but here is a suggestion or two and perhaps someone more expert can step in if needed:
from whisper.tokenizer import Tokenizer, get_tokenizer
t = get_tokenizer(False)
t.decode( [3999] )
t.decode( [9827] ) So the "internal format of whisper" you are referring to is the stream of tokens that are output from the model when processing an audio file; but note that these individual output tokens are not phonemes, they are elements of a special "vocabulary". eg. |
Beta Was this translation helpful? Give feedback.
-
First of all I really appreciate your patience, as I am not an expert and you certainly know more than I do. My question is about non-English languages and it concerns only the transcription, not the language detection nor translation.
I've read both chapters twice.
I tried it, and it felt like my first "Hello world" So the input to the decoder are tokens which are words, are not phonems nor syllables they have the length of a word.
For a particular short audio I would like to see the result of the encoding and before the decoding, where in the code would I need to change something? Thus no decoding just the result after the final step of the encoding. Thank you very much |
Beta Was this translation helpful? Give feedback.
-
Thank you very much. And then I ran the following code to see what 76, 4089,1889 represent. I guess that these are the tokens. Sometimes they represent one letter, other times a sequence of letters.
Very interesting! This is why I'm surprised to see the results of the above code that know that the token with id 1889 for Romanian coresponds to "line". I was expecting this result only after I would have loaded at least one model... Or maybe the model is needed for the encoding part but not for the decoding? |
Beta Was this translation helpful? Give feedback.
-
I can't wait to run this. I'll come back after some days, as I'm now learning how to clone the whisper git to my google colab where I need to change a bit a command line on line 377 as you mentioned above to see the intermediary output. Ideally I would like to inspect some variables as well and have a breakpoint at that line 377. I'm exploring how to do this on google colab and then I'll come back with my revelations :), most likely what you very well know already. I'm very excited about it! |
Beta Was this translation helpful? Give feedback.
-
if u thinking about manually correct output of a LLM like whisper, pretty sure it's a wrong way: u trying to tinkering individual values of neural nets of millions/billions parameters the correct & pratical way is to fine tune whisper on a decent audio dataset of romanian language |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I've tried to read a bit about how whisper works, but I still feel that I'm a novice as compared to some of the experts.
Is there a way to generate with whisper a sort of phonetical transcription, meaning not the actual words but a textual representation of the sounds. In the International Phonetic Alphabet (IPA) each sound has it's associated character.
IPA it's just an example, I would be happy to generate any characters that are uniquely identifying the sounds uttered.
Based on what I could read from the whisper documentation, those would be called tokens, but I'm not sure whether there are several types of tokens. I'm sure that at the end of the process the "tokens" get mapped to the actual words from the dictionary, and what I would want to achieve is to be able to print these "tokens" before they are transformed into actual words.
I guess that for testing the models to see where things go wrong, this would be used/useful.
Could anyone of you help me understand where in the code something needs to be changed. If ever you have the time to show me how it is done it would be greatly appreciated. Otherwise at least pointing to the relevant part that needs to be changed would be great.
Thank you very much
Beta Was this translation helpful? Give feedback.
All reactions