Whisper Large v3 running in real-time on a M2 Macbook Pro #2656
RRUK01
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
ScribeAI.MacOS.Technical.Demo.Large.v3.Short.mp4
I've been working on optimising the Whisper models over the past 2-3 years and getting them to run in real-time and wanted to share my progress.
This is Whisper Large v3 running on a M2 Macbook Pro in real-time with about 350-600ms latency for live/hypothesis (cyan) requests and 900-1200ms for completed (white) requests. It also runs on an iPhone. On my iPhone Pro 14 it runs at 650-850ms latency for live requests and 1900ms for completed requests. I've published an app if anyone wants to try it on their own iPhone. It requires an iPhone with at least 6GB of RAM to run well.
The main breakthrough is finding a way to run the encoder about 3-4x faster without significant quantification. The model is quantised to Q8, to save hard disk space, but runs just as well at FP16. At the moment I have the encoder running on the Apple Neural Engine at about 150ms per run, compared with a 500ms for naive 'ANE-optimised' implementation. There's many more optimisations involved including getting the live output the be more stable between requests. The optimisations work for all the Whisper family of models.
It took a lot of work to get it this point and I just wanted to share it with the community. I plan on writing up a blog post, which I'll publish here when time permits and, if there's interest, in open sourcing a SDK so people can try this out for themselves / use it in their own apps.
Beta Was this translation helpful? Give feedback.
All reactions