adjusting srt length in chinese #1332

raccoonchiu · 2023-05-10T02:42:25Z

raccoonchiu
May 10, 2023

Hello,

I'm currently utilizing Whisper for transcribing a video file. I'm working with the large-v2 model and transcribing in Chinese (Traditional), using the command:

whisper 1111.mp4 --model large-v2 --language "Chinese" --initial_prompt "one talk one line"

The transcription I'm receiving is accurate, but each caption is quite long, which is not ideal for my application. I've tried various methods to adjust the caption length, but I haven't found an effective solution.

Here's an example of the output I'm currently getting:
[00:01.000 --> 00:11.000] 如果肝脏的库珀细胞功能出问题了,它就会突破我们的最后一道防线,开始启动我们的免疫力,清楚了吧?
[00:12.000 --> 00:28.000] 来自食物中的蛋白质,即便偶尔没有完全被消化,它的血首先是到我们的肝脏,过了肝脏才进入我们的大循环,肝脏会提供给机体这个信息,它那个细胞叫库珀细胞,说这是它平时吃的东西,今天你看它吃了八碗饭,不应当。
...
I'd like to know if there's a way to adjust the length of the captions . Ideally, I'd prefer shorter, more manageable lengths for each caption.

Any help or guidance would be greatly appreciated!

Thanks in advance.

Answered by ryanheise

May 12, 2023

I tried your wav file and the options were working fine:

$ ffmpeg -t 30 -i 123.wav 123-30.wav        # trim to 30 seconds
$ whisper --model tiny --word_timestamps True --max_line_width 20 --max_line_count 2 123-30.wav
$ cat 123-30.srt

1
00:00:01,380 --> 00:00:10,000
如果干证的哭吧,习惯功能充分替了他就会突
破我们的最后一道发现开始启动不得没应党

2
00:00:10,640 --> 00:00:17,140
清楚了吧?那就死不住了一代买子,即便偶
尔,没有完全被消化他都学首先是当我们的干

3
00:00:17,140 --> 00:00:24,520
仗,过了干仗才知道我们大循环干仗会提供给
积极这个新鲜他的个细胞叫哭吧,是吧这是他

4
00:00:24,520 --> 00:00:28,760
平日撤到东西,今天你看他撤到八百分不应党

Although the quality of the "tiny" model is lower, I am just testing the line widths here, and every line is indeed wrapped at a maximum of 20 characters and a maximum of 2 lines.

View full answer

ryanheise · 2023-05-10T04:19:29Z

ryanheise
May 10, 2023

There are some options in the latest git version to wrap at a fixed line width and line count:

whisper --word_timestamps True --max_line_width 20 --max_line_count 2 audio.mp3

If you want a more custom solution, you could do things such as spitting after each comma if you write your own custom python code (although note this quickly gets complicated since there may still be long sentences without commas and you'd need to use something like spaCy to figure out more logical places to split that indicate some kind of semantic boundary.)

10 replies

ryanheise May 11, 2023

Assuming you are using the latest version from git and that didn't work, I'd be happy to take a look at it for you if you share a URL to an audio file you would like me to test it on.

raccoonchiu May 12, 2023
Author

https://mega.nz/file/HolRxZAZ#wnABGQijDt70UN0v5wldM3i2FSOY91_K_OcRrVbAXpo
I put it on mega.nz.
thanks for helping.

raccoonchiu May 12, 2023
Author

I've tried using
result = model.transcribe(audio_file_path, language="Chinese", initial_prompt="one talk one line", compression_ratio_threshold=0.2)
result is 1
00:00:00,700 --> 00:00:04,880
如果肝脏的库珀细胞功能出问题了,

2
00:00:04,880 --> 00:00:10,500
它就会突破我们的最后一道防线,开始启动我们的免疫力免疫挡。

3
00:00:10,500 --> 00:00:15,500
清楚了吧?来自食物中的蛋白质,即便偶尔没有完全被消化,

4
00:00:15,500 --> 00:00:20,000
它的血首先是到我们的肝脏,过了肝脏,才进入我们的大循环。

but I don't know why.

ryanheise May 12, 2023

I tried your wav file and the options were working fine:

$ ffmpeg -t 30 -i 123.wav 123-30.wav        # trim to 30 seconds
$ whisper --model tiny --word_timestamps True --max_line_width 20 --max_line_count 2 123-30.wav
$ cat 123-30.srt

1
00:00:01,380 --> 00:00:10,000
如果干证的哭吧,习惯功能充分替了他就会突
破我们的最后一道发现开始启动不得没应党

2
00:00:10,640 --> 00:00:17,140
清楚了吧?那就死不住了一代买子,即便偶
尔,没有完全被消化他都学首先是当我们的干

3
00:00:17,140 --> 00:00:24,520
仗,过了干仗才知道我们大循环干仗会提供给
积极这个新鲜他的个细胞叫哭吧,是吧这是他

4
00:00:24,520 --> 00:00:28,760
平日撤到东西,今天你看他撤到八百分不应党

Although the quality of the "tiny" model is lower, I am just testing the line widths here, and every line is indeed wrapped at a maximum of 20 characters and a maximum of 2 lines.

Answer selected by raccoonchiu

raccoonchiu May 12, 2023
Author

I found that I obtained different results when following your instructions. I am using Ubuntu, and the results I see are as follows. I'm not sure how to determine the cause of the discrepancy.
whisper --model tiny --word_timestamps True --max_line_width 20 --max_line_count 2 123-30.wav
Detecting language using up to the first 30 seconds. Use --language to specify the language
Detected language: Chinese
[00:01.380 --> 00:04.000] 如果干证的哭吧,习惯功能充分替了
[00:04.940 --> 00:07.220] 他就会突破我们的最后一道发现
[00:07.220 --> 00:10.000] 开始启动不得没应党
[00:10.640 --> 00:11.940] 清楚了吧?
[00:12.100 --> 00:15.500] 那就死不住了一代买子,即便偶尔,没有完全被消化
[00:15.500 --> 00:19.260] 他都学首先是当我们的干仗,过了干仗才知道我们大循环
[00:19.260 --> 00:22.040] 干仗会提供给积极这个新鲜
[00:22.040 --> 00:23.760] 他的个细胞叫哭吧,是吧
[00:23.760 --> 00:27.300] 这是他平日撤到东西,今天你看他撤到八百分
[00:27.300 --> 00:28.760] 不应党

ryanheise May 12, 2023

There's no discrepancy so far. You're missing the final step of viewing the generated file: cat 123-30.srt

raccoonchiu May 12, 2023
Author

It works!
I reinstall whisper and it works.
Thank you for helping!

raccoonchiu May 12, 2023
Author

how could I use --max_line_width 15 --max_line_count 2 in python?
I got TypeError by using result = model.transcribe(full_path, language="Chinese", word_timestamps= True, max_line_count= 2)

TypeError Traceback (most recent call last)
/tmp/ipykernel_3751/1442053624.py in
67 full_path = os.path.join(input_dir, f)
68 name = os.path.splitext(os.path.basename(f))[0]
---> 69 result = model.transcribe(full_path, language="Chinese", word_timestamps= True, max_line_count= 2)
70 srt_file_name = '{name}.srt'.format(name=name)
71 w(result, srt_file_name)

/usr/local/lib/python3.10/dist-packages/whisper/transcribe.py in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, prepend_punctuations, append_punctuations, **decode_options)
231
232 decode_options["prompt"] = all_tokens[prompt_reset_since:]
--> 233 result: DecodingResult = decode_with_fallback(mel_segment)
234 tokens = torch.tensor(result.tokens)
235

/usr/local/lib/python3.10/dist-packages/whisper/transcribe.py in decode_with_fallback(segment)
161 kwargs.pop("best_of", None)
162
--> 163 options = DecodingOptions(**kwargs, temperature=t)
164 decode_result = model.decode(segment, options)
165

TypeError: DecodingOptions.init() got an unexpected keyword argument 'max_line_count'

adjusting srt length in chinese #1332

Uh oh!

raccoonchiu May 10, 2023

Replies: 1 comment · 10 replies

Uh oh!

ryanheise May 10, 2023

Uh oh!

ryanheise May 11, 2023

Uh oh!

raccoonchiu May 12, 2023 Author

Uh oh!

Uh oh!

raccoonchiu May 12, 2023 Author

Uh oh!

ryanheise May 12, 2023

Uh oh!

raccoonchiu May 12, 2023 Author

Uh oh!

ryanheise May 12, 2023

Uh oh!

raccoonchiu May 12, 2023 Author

Uh oh!

raccoonchiu May 12, 2023 Author

how could I use --max_line_width 15 --max_line_count 2 in python? I got TypeError by using result = model.transcribe(full_path, language="Chinese", word_timestamps= True, max_line_count= 2)

raccoonchiu
May 10, 2023

Replies: 1 comment 10 replies

ryanheise
May 10, 2023

raccoonchiu May 12, 2023
Author

raccoonchiu May 12, 2023
Author

raccoonchiu May 12, 2023
Author

raccoonchiu May 12, 2023
Author

raccoonchiu May 12, 2023
Author

how could I use --max_line_width 15 --max_line_count 2 in python?
I got TypeError by using result = model.transcribe(full_path, language="Chinese", word_timestamps= True, max_line_count= 2)