Optimizing Whisper Parallel Processing: Memory Management and Multi-GPU Balancing Strategies #2282

gslin1224 · 2024-07-29T12:47:34Z

gslin1224
Jul 29, 2024

When I first parallelize the Whisper process, it returns an error code 500. So, I opened the API with 4 workers (Gunicorn), but it consumes a lot of memory.

However, I searched for issue #360, and it appears that the worker model loads to the CPU while the decoder and encoder are on different GPUs. This approach adds a slight delay of 100ms. It balances the memory, but I'm looking to balance it across 4 GPUs.

Can I balance the model across 4 GPUs, or can I load the model once for 4-8 worker processes, similar to Ollama LLM?

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing Whisper Parallel Processing: Memory Management and Multi-GPU Balancing Strategies #2282

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Optimizing Whisper Parallel Processing: Memory Management and Multi-GPU Balancing Strategies #2282

Uh oh!

gslin1224 Jul 29, 2024

Replies: 0 comments

gslin1224
Jul 29, 2024