Optimizing Settings for Improved Output on longer (25-30 min) MP3/Audio Files #1933

knuurr · 2024-01-02T03:26:58Z

knuurr
Jan 2, 2024

Hi all 👋,

I'm currently working on transcribing 25-30 minute-long MP3 files using Whisper and I'm facing some challenges with the output. I'd appreciate your insights on optimizing settings for the best results.

Key Information:

free Google Collab - I currently use free Collab, mainyl for cost reasons but also because I actually want to learn something. I don't really want spend a buck, so either "get paid service", "use OpenAI API" etc. is NOT really a helpful solution for me.
Input Format: MP3 files, as raw video files pose storage challenges (I run it all on free Google Colab). I use the following ffmpeg command for conversion:
```
ffmpeg -i "$video_file" -vn -ar 44100 -ac 2 -ab 192k -f mp3 "$audio_file"
```
File Length: The files are sitcom audio in the Polish language, and each file ranges from 25 to 30 minutes.
Splitting Considerations: I'm exploring options for optimizing the transcription process without manual splitting. From what I understand Whisper handles splitting automatically into 30-seconds chunks. However maybe this behavior may be optimised by doing it manually?
Current Observations: I've noticed some issues with the current settings. Specifically, certain words are transcribed with characters in words swapped from others in alphabet, resulting in non-existent words in the output.

There is also this thing called Cases in Polish language. There are 7 of them (Noun Declension, Adjective Declension etc.) and I spotted that not that rarely words might be transcribed as they were spoken using different case.

This might be because, I suspect, some letters which change between cases might actually sound softer that other letters - say, "ch" is spoken more softly than "r" - and not that rarely I see that "ch" at the end of word is dropped.

Questions:

Are there specific Whisper settings that work well with MP3 files, considering the sitcom audio in the Polish language and I currently use somewhat compressed MP3 files? Does it even matter what audio format would I use?
Given the length of the files, what strategies do you recommend for handling them without manual splitting? Are there parameters to adjust dynamically during processing? Do manual tinkering might even help?
Has anyone encountered and resolved character swapping issues in the transcription output, given my language? Any advice on mitigating such discrepancies?
Any other thought when it comes to handling longer audio content, such as one coming form show episodes?

Here's how I currently use whisper within my code. Not a lot but I want ot be complete with my usage and config:

result = model.transcribe(source_path, verbose=True, language="pl")

....

srt_writer = get_writer("srt", transcript_path)
        srt_writer(result, mp3_file)

Your insights will be immensely helpful.

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing Settings for Improved Output on longer (25-30 min) MP3/Audio Files #1933

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Optimizing Settings for Improved Output on longer (25-30 min) MP3/Audio Files #1933

Uh oh!

knuurr Jan 2, 2024

Replies: 0 comments

knuurr
Jan 2, 2024