-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Work in progress for batching with multiple audio files [WIP] #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This work continues where SYSTRAN#1302 left off. The goal is to transcribe multiple audio files truly in parallel and increase GPU throughput. For more information please refer to the pull request
|
Hi and thanks for the effort, I guess adding multiple audio batching to |
Would you mind linking to the implementation you had in mind? I get a little loss navigating the transformers library. Also if I make changes to the regular transcription, then I would only be supporting batching for cases 3 and 4 below correct?
In case 2, for example, I believe that the normal transcription calls whisper in 30 second chunks sequentially. So I guess in case 4 we would have a sliding window of 30 seconds across a batch of size N where N corresponds to the number of audio files? EDIT: Also I think the main benefit this PR addresses in the |
|
I'll explain the current approaches for batching in both short-form <30s and long-form >30s:
|
Hey @j-silv The WhisperS2T library does what you're describing - i.e. batches multiple files together. It's my understanding that faster-whisper batches per file and the main difference is that WhisperS2T batches across multiple files. Each has their pluses and minuses...faster-whisper has some features that WhisperS2T doesn't. WhisperX is better for timestamps/captions. But as far as sheer speed when processing numerous files, you can't beat WhisperS2T, especially if there are a lot of shorter files since with faster-whisper the more files there are the more overhead you encounter. Here's my repo where I use WhisperS2T specifically for multiple files. |
Concerns issues #915, #1177, #1264, #1192, #840, #1330, #59, #66
This work continues where PR #1302 left off. The goal is to transcribe multiple audio files truly in parallel and increase GPU throughput.
This PR is not done yet, but I wanted to give a preview and discuss some of the choices I have made so far.
Here's an overview of what I did so far
Batch a list of files (
batch_audio_files()) and pad each decoded audio to max length within that batch. Users can also just pass in an already batched numpy array, where it's assumed that the first dimension corresponds to different audio file content.I modified the
transcribefunction inBatchedInferencePipelineto support batching across multiple files, while still enabling batching across an individual large file. This was a little tricky, but essentially we keep track of how many chunks correspond to each audio file within a batch (added data membernum_chunkstoTranscriptionInfo). Then we flatten everything by stacking the features on top of each other. Finally, we pass it into_batched_segments_generator, which doesn't have to change since it already supports batching.In addition to performing parallel transcription, I also set up batched VAD. The post-inference algorithm is ran for each individual audio file within the batch and the results are appended together.
Some of the decisions I took:
There are some workarounds in the issues I linked to support multiple audio files. For example, chunking the decoded audio, padding it, and then manually passing in some clip or VAD parameters. I think the main downside to this is that you'd have to do a lot more work to be able to run VAD on the individual audio clips in batch. Plus there's much more work for the user.
Right now, the user is responsible for regrouping the flattened segments array, such that the transcriptions for each audio file are independent. I return a list of generators (one for each batch) in order to do this. I then use
num_chunksto know when to stop processing segments for the particular audio file. Seetest_transcribe::test_batched_transcribe_manyfor more info on this. I wasn't sure the best way to approach this, since the original code yields a single segment at a time. Happy to discuss if there are better methods. Ideally I would think the user should not have to worry about regrouping a flat segment array...It is assumed that all audio files are in the same language. In theory you could support batching different languages by instantiating a new
Tokenizerwithin the batch loop with a language indexed from an array (provided by the user or detected automatically). I'm just not sure how common that use-case is.Some pending todos:
I only handled batch transcription when the language is specified, VAD is enabled, and
clip_timestampsis not provided. I have to make some modifications such that other conditional branches intranscribework (multiple tests are failing because of this).I have not exhaustively tested the batching to make sure the results are as expected. I think an additional test to run is to perform batch transcribe with 2 batched audio files, and then compare the results with calling normal transcribe twice with each of those audio files separately.
Run performance tests on GPU to make sure batching results in a speed-up (also make sure utilization goes up as expected)
Documentation needs to be rewritten up on how to use this new batch mode
Please share any concerns or suggestions you have!