Cannot detect the regional variations/ dialects of language using Whisper Speech Recognition model #1721

PriyankaKB · 2023-10-18T12:07:34Z

PriyankaKB
Oct 18, 2023

I am trying to detect the regional variations in language using Whisper Speech Recognition Model. Below is the code that I have tried...

import whisper

model = whisper.load_model("base")

Load audio and pad/trim it to fit 30 seconds

audio = whisper.load_audio("output.wav")
audio = whisper.pad_or_trim(audio)

Make log-Mel spectrogram and move to the same device as the model

mel = whisper.log_mel_spectrogram(audio).float().to(model.device) # Convert to float32

Detect the spoken language

_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

Decode the audio

options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel, options)

Print the recognized text

print(result.text)

Output:
Detected language: en
$28,000. $28,000. $28,000. Good luck. John, you ready? I'm ready. Hope you like it. Let's go ahead, George. John. This is sugar.

I want to detect regional variations/ dialects like "en-US", "en-GB", "en-AU" etc. as per country/region.
Is it possible to detect such dialects with Whisper Speech Recognition Model?

Please, help. Also, suggest if there is any way we can integrate such functionality along with Whisper model...

Thanks in advance!

phineas-pta · 2023-10-18T17:18:06Z

phineas-pta
Oct 18, 2023

u cannot, there's no model to do that

6 replies

jasgithub101 Dec 26, 2024

u cannot, there's no model to do that

Hey @phineas-pta
I'm working on my uni final year project, can you answer this for me. I'd really really appreciate it

will it be possible to fine-tune the whisper model on various dialects of a language at all? I have audio, transcriptions and dialect names in a dataset. If there is any possibility to add new tokens to the model and retrieve the probability of the test audio belonging to those token specific dialects by using logits from the model ouput.

phineas-pta Dec 26, 2024

token list for languages is basically frozen, u can see it here: https://github.com/openai/whisper/blob/main/whisper/tokenizer.py

of course u can create new tokenizer but that means re-train the model from scratch

u should ask your professor whether there's a model for dialect recognition, it'd be the most suitable for the task

jasgithub101 Dec 27, 2024

@phineas-pta is there any other way to do this?
Here I'm trying to train the model on multiple dialects of a language that whisper has already been trained on. I suspect that I can use that same language's tokenizer.
for that how can I get the model to be finetuned on the dialects and get the predicted dialect by taking the probabilities from logits or smtg, if that's possible.
I used GPT for help and these are 2 codes it has given me for the model, I suspect that gpt is just hallucinating, can you also please confirm if this is the right way:

class WhisperForDialectClassification(WhisperForConditionalGeneration):
    def __init__(self, config: WhisperConfig):
        super().__init__(config)
        # Add a classification head for dialect classification (3 classes)
        self.classification_head = nn.Linear(config.d_model, 3)  # Adjust 3 for the number of dialects

        def forward(self, input_features=None, attention_mask=None, decoder_input_ids=None, labels=None):
            # Pass through the Whisper model (encoder-decoder)
            outputs = self.model(
                input_features=input_features,
                attention_mask=attention_mask,
                decoder_input_ids=decoder_input_ids  # Ensure this is passed
            )
            
            # Extract the last hidden state from the encoder
            hidden_states = outputs.encoder_last_hidden_state
            
            # You can skip this part if you are using a classification head
            if self.classification_head:
                logits = self.classification_head(hidden_states[:, 0, :])  # Assuming the first token is used for classification
                return logits
            
            return outputs

class WhisperForDialectClassification(WhisperForConditionalGeneration):
    def __init__(self, config):
        super().__init__(config)
        # Add a classification head on top of the model
        self.classification_head = nn.Linear(config.d_model, 3)  # Assuming 3 dialects (or classes)

    def forward(self, input_features=None, attention_mask=None, labels=None):
        # Pass through the base Whisper model's encoder
        encoder_outputs = self.model.encoder(input_features=input_features, attention_mask=attention_mask)

        # Get the encoder output (the hidden states)
        hidden_states = encoder_outputs.last_hidden_state
        
        # Apply the classification head
        logits = self.classification_head(hidden_states[:, -1, :])  # Use the last hidden state

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))

        return (loss, logits) if loss is not None else logits

jasgithub101 Dec 28, 2024

@phineas-pta can you please take a look into this.

phineas-pta Dec 28, 2024

seem like it's possible to do audio classification without modify whisper architecture: https://discuss.huggingface.co/t/whisper-for-audio-classification/62376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot detect the regional variations/ dialects of language using Whisper Speech Recognition model #1721

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cannot detect the regional variations/ dialects of language using Whisper Speech Recognition model #1721

Uh oh!

PriyankaKB Oct 18, 2023

I am trying to detect the regional variations in language using Whisper Speech Recognition Model. Below is the code that I have tried...

Load audio and pad/trim it to fit 30 seconds

Make log-Mel spectrogram and move to the same device as the model

Detect the spoken language

Decode the audio

Print the recognized text

print(result.text)

Replies: 1 comment · 6 replies

Uh oh!

phineas-pta Oct 18, 2023

Uh oh!

Uh oh!

jasgithub101 Dec 26, 2024

Uh oh!

phineas-pta Dec 26, 2024

Uh oh!

Uh oh!

jasgithub101 Dec 27, 2024

Uh oh!

jasgithub101 Dec 28, 2024

Uh oh!

phineas-pta Dec 28, 2024

PriyankaKB
Oct 18, 2023

Replies: 1 comment 6 replies

phineas-pta
Oct 18, 2023