Skip to content

Comparing WhisperX and Faster-Whisper on RunPod: Speed, Accuracy, and OptimizationΒ #1066

@yccheok

Description

@yccheok

Recently, I compared the performance of WhisperX and Faster-Whisper on RunPod's server using the following code snippet.

WhisperX

model = whisperx.load_model(
    "large-v3", "cuda"
)

def run_whisperx_job(job):
    start_time = time.time()

    job_input = job['input']
    url = job_input.get('url', "")

    print(f"🚧 Loading audio from {url}...")
    audio = whisperx.load_audio(url)
    print("βœ… Audio loaded")

    print("Transcribing...")
    result = model.transcribe(audio, batch_size=16)

    end_time = time.time()
    time_s = (end_time - start_time)
    print(f"πŸŽ‰ Transcription done: {time_s:.2f} s")
    #print(result)

    # For easy migration, we are following the output format of runpod's 
    # official faster whisper.
    # https://github.com/runpod-workers/worker-faster_whisper/blob/main/src/predict.py#L111
    output = {
        'detected_language' : result['language'],
        'segments' : result['segments']
    }

    return output

Faster-whisper

# Load Faster-Whisper model
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

def run_faster_whisper_job(job):
    start_time = time.time()
    
    job_input = job['input']
    url = job_input.get('url', "")

    print(f"🚧 Downloading audio from {url}...")
    audio_path = download_files_from_urls(job['id'], [url])[0]
    print("βœ… Audio downloaded")
    
    print("Transcribing...")
    segments, info = model.transcribe(audio_path, beam_size=5)
    
    output_segments = []
    for segment in segments:
        output_segments.append({
            "start": segment.start,
            "end": segment.end,
            "text": segment.text
        })
    
    end_time = time.time()
    time_s = (end_time - start_time)
    print(f"πŸŽ‰ Transcription done: {time_s:.2f} s")
    
    output = {
        'detected_language': info.language,
        'segments': output_segments
    }
    
    # βœ… Safely delete the file after transcription
    try:
        if os.path.exists(audio_path):
            os.remove(audio_path)  # Using os.remove()
            print(f"πŸ—‘οΈ Deleted {audio_path}")
        else:
            print("⚠️ File not found, skipping deletion")
    except Exception as e:
        print(f"❌ Error deleting file: {e}")

    rp_cleanup.clean(['input_objects'])

    return output

General Findings

  • WhisperX is significantly faster than Faster-Whisper.
  • WhisperX can process long-duration audio (3 hours), whereas Faster-Whisper encounters unknown runtime errors. My guess is that Faster-Whisper requires more GPU/memory resources to complete the job.

Accuracy Observations

  • WhisperX is less accurate than Faster-Whisper.
  • WhisperX has more missing words than Faster-Whisper.

Optimization Questions

I was wondering what parameters in WhisperX I can experiment with or fine-tune in order to:

  • Improve accuracy
  • Reduce missing words
  • Without significantly increasing processing time

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions