Skip to content

[whisper] Add Flash Attention and batched decoding for up to 10x speedup #1412

@ilyasmukiev

Description

@ilyasmukiev

Summary

Added Flash Attention and batched segment decoding to mlx-whisper, achieving 9.5x speedup on Apple Silicon.

Changes

1. Flash Attention (whisper.py)

  • Replaced manual QKV attention with mx.fast.scaled_dot_product_attention
  • Conditional path: uses flash attention by default, falls back to standard attention when word_timestamps=True (needs QK weights for DTW alignment)
  • Proper mask handling for autoregressive decoding with KV cache

2. Batched decoding (transcribe.py)

  • New batch_size parameter in transcribe() (default=1, fully backward-compatible)
  • Pre-slices audio into fixed 30s chunks, stacks into batch tensor (N, 3000, n_mels), decodes simultaneously
  • Per-segment temperature fallback for quality control
  • batch_size=1 produces identical output to current code

Zero new dependencies. No breaking changes.

Benchmarks (M2 8GB, whisper-small, 5 min Russian audio)

Mode Time Realtime Factor Speedup
Sequential (batch_size=1) 9.4s 4.8x RT 1x
Batched (batch_size=12) 6.6s 44.8x RT 9.5x

For a 15-hour video: ~20 minutes instead of ~3 hours.

Code

Full implementation with benchmarks: https://github.com/ilyasmukiev/mlx-whisper-pr

  • Branch flash-attention-batch: minimal changes (Flash Attention + batch_size parameter only)
  • Branch full-batching-vad-diarize: adds optional VAD (Silero) and speaker diarization

Standalone package: https://github.com/ilyasmukiev/mlx-whisper-fast

Notes

  • Could not create a PR directly because gh repo fork returns HTTP 502 (repo too large?)
  • Happy to submit a proper PR once the fork works
  • The batched path uses fixed-stride chunking (no dynamic seeking), which is a deliberate trade-off for parallelism — same approach as WhisperX and lightning-whisper-mlx
  • Related: Discussion Unexpected Processing Times for Short vs. Long Audio Files with MLX-Whisper #1275 where batching was acknowledged as possible but not implemented

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions