You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I first parallelize the Whisper process, it returns an error code 500. So, I opened the API with 4 workers (Gunicorn), but it consumes a lot of memory.
However, I searched for issue #360, and it appears that the worker model loads to the CPU while the decoder and encoder are on different GPUs. This approach adds a slight delay of 100ms. It balances the memory, but I'm looking to balance it across 4 GPUs.
Can I balance the model across 4 GPUs, or can I load the model once for 4-8 worker processes, similar to Ollama LLM?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
When I first parallelize the Whisper process, it returns an error code 500. So, I opened the API with 4 workers (Gunicorn), but it consumes a lot of memory.
However, I searched for issue #360, and it appears that the worker model loads to the CPU while the decoder and encoder are on different GPUs. This approach adds a slight delay of 100ms. It balances the memory, but I'm looking to balance it across 4 GPUs.
Can I balance the model across 4 GPUs, or can I load the model once for 4-8 worker processes, similar to Ollama LLM?
Thanks a lot!
Beta Was this translation helpful? Give feedback.
All reactions