Date: February 19, 2026
Hardware: NVIDIA GeForce RTX 4090 (24GB VRAM)
CUDA: 13.0
PyTorch: 2.10.0+cu130
Objective: Compare transformers vs faster-whisper STT performance on RTX 4090
Configuration:
- Model: openai/whisper-large-v3-turbo
- Device: CUDA (GPU)
- Audio durations: 2s, 5s, 10s
- Runs per duration: 3
| Audio Duration | Transformers Mean | Faster-Whisper Mean | GPU Speedup (Transformers) | GPU Speedup (Faster-Whisper) |
|---|---|---|---|---|
| 2s | 0.090s | 0.205s | 150x (vs 13.5s CPU) | 73x (vs 15s CPU) |
| 5s | 0.092s | 0.219s | 147x (vs 13.5s CPU) | 69x (vs 15s CPU) |
| 10s | 0.093s | 0.208s | 146x (vs 13.5s CPU) | 72x (vs 15s CPU) |
Memory Usage:
- Transformers: 830 MB initial load, minimal delta after
- Faster-whisper: 411 MB initial load, ~1 MB delta per run
✅ Transformers is 2.3x faster than faster-whisper on RTX 4090
- Transformers: ~90ms mean latency (consistent across all durations)
- Faster-whisper: ~210ms mean latency
✅ GPU acceleration is dramatic:
- CPU results: transformers ~13.5s, faster-whisper ~15s
- GPU results: transformers ~0.09s, faster-whisper ~0.21s
- 150x speedup for transformers on GPU vs CPU
Recommendation: Use transformers pipeline for STT with GPU. Provides best performance at ~90ms latency, well under 2-second target.
Objective: Measure vLLM Time-To-First-Token with and without prefix caching
Configuration:
- Endpoint: http://localhost:8000/v1
- Model: AMead10/Llama-3.2-3B-Instruct-AWQ
- Runs: 10 (5 cold start, 10 warm cache)
- Test queries: 5 different questions per run
Cold Start (No Cache):
| Metric | Value |
|---|---|
| Mean TTFT | 31.3ms |
| Median TTFT | 15.3ms |
| Min TTFT | 14.2ms |
| Max TTFT | 94.8ms |
Warm Cache (Prefix Caching):
| Metric | Value |
|---|---|
| Mean TTFT | 28.6ms |
| Median TTFT | 16.4ms |
| Min TTFT | 14.7ms |
| Max TTFT | 139.6ms |
Cache Speedup: 1.09x (marginal improvement)
✅ TTFT is exceptionally fast: Most queries complete in 14-18ms
✅ Consistent performance: After warmup, TTFT variance is minimal
❌ Prefix caching minimal benefit: Only 1.09x speedup - prefix caching not providing significant improvement for this model/workload
Token Generation Speed:
- Typical responses: 7-100 tokens
- Total generation time: 47-530ms depending on response length
- Token throughput: ~200-250 tokens/second
Streaming Benefits: With TTFT of ~15ms, streaming starts almost immediately after request, enabling:
- Sentence-based chunking to TTS
- Parallel LLM generation and TTS processing
- Minimal user-perceived latency
Recommendation: TTFT of 15-30ms is excellent for real-time voice applications. Prefix caching shows minimal benefit - consider disabling if it adds complexity.
Objective: Measure XTTS generation latency by sentence length
Configuration:
- Endpoint: http://localhost:8002
- Word count range: 5 - 50 (step: 10)
- Runs per length: 5
- Speaker: default speaker configuration
| Word Count | TTFB Mean | TTFB Median | Total Time Mean | Total Time Median | ms/word (TTFB) | ms/word (Total) |
|---|---|---|---|---|---|---|
| 5 | 0.967s | 0.635s | 1.948s | 1.790s | 194ms | 390ms |
| 15 | 0.634s | 0.630s | 2.996s | 3.006s | 42ms | 200ms |
| 25 | 0.644s | 0.637s | 5.070s | 5.283s | 26ms | 203ms |
| 35 | 0.700s | 0.689s | 6.612s | 6.604s | 20ms | 189ms |
| 45 | 0.699s | 0.694s | 8.481s | 8.468s | 16ms | 188ms |
Overall Averages:
- TTFB per word: 59ms (average across all lengths)
- Total time per word: 234ms (average across all lengths)
✅ Time To First Byte is consistent: ~0.63-0.70s regardless of sentence length (after first-run warmup)
✅ Total generation time scales linearly: ~234ms per word on average
For a 10-word response:
- TTFB: ~0.63s (first audio chunk)
- Total: ~2.3s (complete generation)
- Streaming benefit: Can start playback 1.7s before generation completes
Notable: TTFB of ~630ms is acceptable for streaming pipeline. With proper sentence segmentation, first audio chunk arrives quickly enough to meet latency goals.
Objective: Measure how TTFT and generation time scale with conversation history length
Configuration:
- Endpoint: http://localhost:8000/v1
- Model: AMead10/Llama-3.2-3B-Instruct-AWQ
- Message counts tested: 10, 100, 300, 1000
- Runs per length: 5
- Varied queries to avoid KV cache hits
| Messages | Tokens | TTFT Median | TTFT Mean | Total Time Median | Total Time Mean | Tokens Generated |
|---|---|---|---|---|---|---|
| 10 | 242 | 19ms | 91ms | 400ms | 438ms | 57-100 |
| 100 | 2,719 | 31ms | 62ms | 445ms | 443ms | 72-100 |
| 300 | 8,314 | 72ms | 141ms | 567ms | 613ms | 79-100 |
| 1000 | 27,897 | 180ms | 541ms | 919ms | 1,264ms | 89-100 |
✅ TTFT scales predictably with context: 5.9x increase from 10 to 1000 messages (19ms median → 180ms median)
✅ Total time scales slower than TTFT: 2.3x increase (400ms → 919ms)
✅ Responses are unique: Different outputs confirmed per run - not hitting cache
Scaling Analysis:
- 100 messages (typical Discord chat): 31ms TTFT, 445ms total ✅
- 300 messages (long conversation): 72ms TTFT, 567ms total ✅
- 1000 messages (very long): 180ms TTFT, 919ms total
⚠️
Why it's fast:
- AWQ Quantization: 4-bit quantized model reduces memory bandwidth
- Small model size: 3B parameters vs typical 7B/8B models
- vLLM optimizations: PagedAttention, continuous batching, GPU kernels
Context impact:
- Each 100 messages adds ~15-20ms TTFT
- Generation speed (tokens/sec) remains constant regardless of context
- Difference between TTFT and total time grows with context (more prompt processing)
Recommendation: For Discord voice chat:
- Keep conversations under 300 messages for optimal performance (<70ms TTFT)
- Implement conversation pruning for very long chats
- AWQ quantization is excellent for low-latency applications on RTX 4090
Scenario: 10-word response with 100-message history (typical Discord chat)
| Component | Latency | Notes |
|---|---|---|
| STT (Transformers GPU) | 0.09s | ✅ Very fast |
| LLM TTFT (100 msgs) | 0.031s | ✅ Exceptionally fast (31ms median) |
| TTS TTFB (10 words) | ~0.63s | |
| Audio Startup | 1.01s | |
| Sequential Total | 1.76s | ✅ Under 2s target! |
Scenario: Long conversation with 1000 messages
| Component | Latency | Notes |
|---|---|---|
| STT (Transformers GPU) | 0.09s | ✅ Very fast |
| LLM TTFT (1000 msgs) | 0.180s | |
| TTS TTFB (10 words) | ~0.63s | |
| Audio Startup | 1.01s | |
| Sequential Total | 1.91s | ✅ Still under 2s target! |
With overlapping execution (LLM streaming → TTS → Playback):
Typical conversation (100 messages):
- User stops speaking: STT starts
- +0.09s: STT complete, LLM starts generating
- +0.12s: First sentence from LLM (31ms TTFT + ~20ms to complete sentence)
- +0.75s: First audio chunk from TTS (0.63s TTFB)
- +0.75s: Start Discord playback (parallel with TTS generation)
- +1.76s: First audio plays to user (0.75s + 1.01s playback startup)
Optimized Total: ~1.76s for first audio ✅
Long conversation (1000 messages):
- First audio plays at +1.91s (still under 2s target) ✅
- ✅ Use Transformers STT with GPU - 90ms is excellent
- ✅ Use AWQ quantized models - 3B AWQ provides 31-180ms TTFT depending on context
- ✅ Implement conversation pruning - Keep history under 300 messages for <70ms TTFT
- ✅ Implement sentence-based streaming - Start TTS as soon as first complete sentence arrives
- ✅ Parallel execution - Don't wait for full TTS generation before starting playback
⚠️ Monitor conversation length - Use sliding window or summarization for very long chats⚠️ Consider pre-warming audio stream - If possible, initialize FFmpeg early to avoid 1s startup penalty
Goal: <2 second end-to-end latency
| Metric | Target | Measured/Estimated | Status |
|---|---|---|---|
| STT | <200ms | 90ms | ✅ PASS |
| LLM TTFT (100 msgs) | <300ms | 31ms | ✅ PASS |
| LLM TTFT (1000 msgs) | <300ms | 180ms | ✅ PASS |
| TTS TTFB | <1000ms | 630ms | ✅ PASS |
| Audio Startup | <500ms | 1010ms | ❌ FAIL |
| Total (100 msgs) | <2000ms | ~1760ms | ✅ PASS |
| Total (1000 msgs) | <2000ms | ~1910ms | ✅ PASS |
With the streaming architecture, the <2 second latency target is achievable ✅
All benchmarks complete! ✅
- STT: 90ms ✅
- LLM TTFT: 31-180ms (context-dependent) ✅
- TTS TTFB: 630ms ✅
- Audio Startup: 1010ms (unavoidable FFmpeg overhead)
Context length impact validated:
- 100 messages (typical): 1.76s total latency ✅
- 1000 messages (very long): 1.91s total latency ✅
Total end-to-end latency: ~1.76-1.91s - well under the 2-second target!