Comparing WhisperX and Faster-Whisper on RunPod: Speed, Accuracy, and Optimization

Recently, I compared the performance of WhisperX and Faster-Whisper on RunPod's server using the following code snippet.

WhisperX
===
    model = whisperx.load_model(
        "large-v3", "cuda"
    )

    def run_whisperx_job(job):
        start_time = time.time()

        job_input = job['input']
        url = job_input.get('url', "")

        print(f"🚧 Loading audio from {url}...")
        audio = whisperx.load_audio(url)
        print("✅ Audio loaded")

        print("Transcribing...")
        result = model.transcribe(audio, batch_size=16)

        end_time = time.time()
        time_s = (end_time - start_time)
        print(f"🎉 Transcription done: {time_s:.2f} s")
        #print(result)

        # For easy migration, we are following the output format of runpod's 
        # official faster whisper.
        # https://github.com/runpod-workers/worker-faster_whisper/blob/main/src/predict.py#L111
        output = {
            'detected_language' : result['language'],
            'segments' : result['segments']
        }

        return output

Faster-whisper
===
    # Load Faster-Whisper model
    model = WhisperModel("large-v3", device="cuda", compute_type="float16")

    def run_faster_whisper_job(job):
        start_time = time.time()
        
        job_input = job['input']
        url = job_input.get('url', "")

        print(f"🚧 Downloading audio from {url}...")
        audio_path = download_files_from_urls(job['id'], [url])[0]
        print("✅ Audio downloaded")
        
        print("Transcribing...")
        segments, info = model.transcribe(audio_path, beam_size=5)
        
        output_segments = []
        for segment in segments:
            output_segments.append({
                "start": segment.start,
                "end": segment.end,
                "text": segment.text
            })
        
        end_time = time.time()
        time_s = (end_time - start_time)
        print(f"🎉 Transcription done: {time_s:.2f} s")
        
        output = {
            'detected_language': info.language,
            'segments': output_segments
        }
        
        # ✅ Safely delete the file after transcription
        try:
            if os.path.exists(audio_path):
                os.remove(audio_path)  # Using os.remove()
                print(f"🗑️ Deleted {audio_path}")
            else:
                print("⚠️ File not found, skipping deletion")
        except Exception as e:
            print(f"❌ Error deleting file: {e}")

        rp_cleanup.clean(['input_objects'])

        return output

General Findings
===
- WhisperX is significantly faster than Faster-Whisper.
- WhisperX can process long-duration audio (3 hours), whereas Faster-Whisper encounters unknown runtime errors. My guess is that Faster-Whisper requires more GPU/memory resources to complete the job.

Accuracy Observations
===
- WhisperX is less accurate than Faster-Whisper.
- WhisperX has more missing words than Faster-Whisper.

Optimization Questions
===
I was wondering what parameters in WhisperX I can experiment with or fine-tune in order to:

- Improve accuracy
- Reduce missing words
- Without significantly increasing processing time

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing WhisperX and Faster-Whisper on RunPod: Speed, Accuracy, and Optimization #1066

WhisperX

Faster-whisper

General Findings

Accuracy Observations

Optimization Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Comparing WhisperX and Faster-Whisper on RunPod: Speed, Accuracy, and Optimization #1066

Description

WhisperX

Faster-whisper

General Findings

Accuracy Observations

Optimization Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions