Whisper Large v3 running in real-time on a M2 Macbook Pro #2656

RRUK01 · 2025-09-20T19:44:59Z

RRUK01
Sep 20, 2025

ScribeAI.MacOS.Technical.Demo.Large.v3.Short.mp4

I've been working on optimising the Whisper models over the past 2-3 years and getting them to run in real-time and wanted to share my progress.

This is Whisper Large v3 running on a M2 Macbook Pro in real-time with about 350-600ms latency for live/hypothesis (cyan) requests and 900-1200ms for completed (white) requests. It also runs on an iPhone. On my iPhone Pro 14 it runs at 650-850ms latency for live requests and 1900ms for completed requests. I've published an app if anyone wants to try it on their own iPhone. It requires an iPhone with at least 6GB of RAM to run well.

The main breakthrough is finding a way to run the encoder about 3-4x faster without significant quantification. The model is quantised to Q8, to save hard disk space, but runs just as well at FP16. At the moment I have the encoder running on the Apple Neural Engine at about 150ms per run, compared with a 500ms for naive 'ANE-optimised' implementation. There's many more optimisations involved including getting the live output the be more stable between requests. The optimisations work for all the Whisper family of models.

It took a lot of work to get it this point and I just wanted to share it with the community. I plan on writing up a blog post, which I'll publish here when time permits and, if there's interest, in open sourcing a SDK so people can try this out for themselves / use it in their own apps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Large v3 running in real-time on a M2 Macbook Pro #2656

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Whisper Large v3 running in real-time on a M2 Macbook Pro #2656

Uh oh!

RRUK01 Sep 20, 2025

Replies: 0 comments

RRUK01
Sep 20, 2025