Skip to content

Latest commit

 

History

History
291 lines (204 loc) · 10.8 KB

File metadata and controls

291 lines (204 loc) · 10.8 KB

Experiment Results

Date: February 19, 2026
Hardware: NVIDIA GeForce RTX 4090 (24GB VRAM)
CUDA: 13.0
PyTorch: 2.10.0+cu130


Experiment 01: STT Comparison (GPU)

Objective: Compare transformers vs faster-whisper STT performance on RTX 4090

Configuration:

  • Model: openai/whisper-large-v3-turbo
  • Device: CUDA (GPU)
  • Audio durations: 2s, 5s, 10s
  • Runs per duration: 3

Results

Audio Duration Transformers Mean Faster-Whisper Mean GPU Speedup (Transformers) GPU Speedup (Faster-Whisper)
2s 0.090s 0.205s 150x (vs 13.5s CPU) 73x (vs 15s CPU)
5s 0.092s 0.219s 147x (vs 13.5s CPU) 69x (vs 15s CPU)
10s 0.093s 0.208s 146x (vs 13.5s CPU) 72x (vs 15s CPU)

Memory Usage:

  • Transformers: 830 MB initial load, minimal delta after
  • Faster-whisper: 411 MB initial load, ~1 MB delta per run

Key Findings

Transformers is 2.3x faster than faster-whisper on RTX 4090

  • Transformers: ~90ms mean latency (consistent across all durations)
  • Faster-whisper: ~210ms mean latency

GPU acceleration is dramatic:

  • CPU results: transformers ~13.5s, faster-whisper ~15s
  • GPU results: transformers ~0.09s, faster-whisper ~0.21s
  • 150x speedup for transformers on GPU vs CPU

Verdict: PASS

Recommendation: Use transformers pipeline for STT with GPU. Provides best performance at ~90ms latency, well under 2-second target.


Experiment 02: LLM Time-To-First-Token

Objective: Measure vLLM Time-To-First-Token with and without prefix caching

Configuration:

  • Endpoint: http://localhost:8000/v1
  • Model: AMead10/Llama-3.2-3B-Instruct-AWQ
  • Runs: 10 (5 cold start, 10 warm cache)
  • Test queries: 5 different questions per run

Results

Cold Start (No Cache):

Metric Value
Mean TTFT 31.3ms
Median TTFT 15.3ms
Min TTFT 14.2ms
Max TTFT 94.8ms

Warm Cache (Prefix Caching):

Metric Value
Mean TTFT 28.6ms
Median TTFT 16.4ms
Min TTFT 14.7ms
Max TTFT 139.6ms

Cache Speedup: 1.09x (marginal improvement)

Key Findings

TTFT is exceptionally fast: Most queries complete in 14-18ms

⚠️ First-run penalty: Initial request takes ~95-140ms (model warmup), subsequent requests stabilize at 14-18ms

Consistent performance: After warmup, TTFT variance is minimal

Prefix caching minimal benefit: Only 1.09x speedup - prefix caching not providing significant improvement for this model/workload

Detailed Analysis

Token Generation Speed:

  • Typical responses: 7-100 tokens
  • Total generation time: 47-530ms depending on response length
  • Token throughput: ~200-250 tokens/second

Streaming Benefits: With TTFT of ~15ms, streaming starts almost immediately after request, enabling:

  • Sentence-based chunking to TTS
  • Parallel LLM generation and TTS processing
  • Minimal user-perceived latency

Verdict: PASS

Recommendation: TTFT of 15-30ms is excellent for real-time voice applications. Prefix caching shows minimal benefit - consider disabling if it adds complexity.


Experiment 03: TTS Latency

Objective: Measure XTTS generation latency by sentence length

Configuration:

  • Endpoint: http://localhost:8002
  • Word count range: 5 - 50 (step: 10)
  • Runs per length: 5
  • Speaker: default speaker configuration

Results

Word Count TTFB Mean TTFB Median Total Time Mean Total Time Median ms/word (TTFB) ms/word (Total)
5 0.967s 0.635s 1.948s 1.790s 194ms 390ms
15 0.634s 0.630s 2.996s 3.006s 42ms 200ms
25 0.644s 0.637s 5.070s 5.283s 26ms 203ms
35 0.700s 0.689s 6.612s 6.604s 20ms 189ms
45 0.699s 0.694s 8.481s 8.468s 16ms 188ms

Overall Averages:

  • TTFB per word: 59ms (average across all lengths)
  • Total time per word: 234ms (average across all lengths)

Key Findings

Time To First Byte is consistent: ~0.63-0.70s regardless of sentence length (after first-run warmup)

Total generation time scales linearly: ~234ms per word on average

⚠️ First run penalty: Initial request took 2.3s TTFB (likely model loading), subsequent requests stabilized at ~0.63s

Streaming Implications

For a 10-word response:

  • TTFB: ~0.63s (first audio chunk)
  • Total: ~2.3s (complete generation)
  • Streaming benefit: Can start playback 1.7s before generation completes

Verdict: PASS

Notable: TTFB of ~630ms is acceptable for streaming pipeline. With proper sentence segmentation, first audio chunk arrives quickly enough to meet latency goals.


Experiment 04: LLM Context Length Impact

Objective: Measure how TTFT and generation time scale with conversation history length

Configuration:

  • Endpoint: http://localhost:8000/v1
  • Model: AMead10/Llama-3.2-3B-Instruct-AWQ
  • Message counts tested: 10, 100, 300, 1000
  • Runs per length: 5
  • Varied queries to avoid KV cache hits

Results

Messages Tokens TTFT Median TTFT Mean Total Time Median Total Time Mean Tokens Generated
10 242 19ms 91ms 400ms 438ms 57-100
100 2,719 31ms 62ms 445ms 443ms 72-100
300 8,314 72ms 141ms 567ms 613ms 79-100
1000 27,897 180ms 541ms 919ms 1,264ms 89-100

Key Findings

TTFT scales predictably with context: 5.9x increase from 10 to 1000 messages (19ms median → 180ms median)

Total time scales slower than TTFT: 2.3x increase (400ms → 919ms)

Responses are unique: Different outputs confirmed per run - not hitting cache

⚠️ First-request penalty exists: Initial warmup adds 200-400ms, subsequent requests faster

Scaling Analysis:

  • 100 messages (typical Discord chat): 31ms TTFT, 445ms total ✅
  • 300 messages (long conversation): 72ms TTFT, 567ms total ✅
  • 1000 messages (very long): 180ms TTFT, 919ms total ⚠️

Performance Characteristics

Why it's fast:

  1. AWQ Quantization: 4-bit quantized model reduces memory bandwidth
  2. Small model size: 3B parameters vs typical 7B/8B models
  3. vLLM optimizations: PagedAttention, continuous batching, GPU kernels

Context impact:

  • Each 100 messages adds ~15-20ms TTFT
  • Generation speed (tokens/sec) remains constant regardless of context
  • Difference between TTFT and total time grows with context (more prompt processing)

Verdict: PASS

Recommendation: For Discord voice chat:

  • Keep conversations under 300 messages for optimal performance (<70ms TTFT)
  • Implement conversation pruning for very long chats
  • AWQ quantization is excellent for low-latency applications on RTX 4090

Combined Latency Analysis

Current Measured End-to-End Latency

Scenario: 10-word response with 100-message history (typical Discord chat)

Component Latency Notes
STT (Transformers GPU) 0.09s ✅ Very fast
LLM TTFT (100 msgs) 0.031s ✅ Exceptionally fast (31ms median)
TTS TTFB (10 words) ~0.63s ⚠️ Significant but acceptable
Audio Startup 1.01s ⚠️ Fixed FFmpeg overhead
Sequential Total 1.76s ✅ Under 2s target!

Scenario: Long conversation with 1000 messages

Component Latency Notes
STT (Transformers GPU) 0.09s ✅ Very fast
LLM TTFT (1000 msgs) 0.180s ⚠️ Noticeable but acceptable
TTS TTFB (10 words) ~0.63s ⚠️ Significant but acceptable
Audio Startup 1.01s ⚠️ Fixed FFmpeg overhead
Sequential Total 1.91s ✅ Still under 2s target!

Streaming Pipeline Optimization

With overlapping execution (LLM streaming → TTS → Playback):

Typical conversation (100 messages):

  1. User stops speaking: STT starts
  2. +0.09s: STT complete, LLM starts generating
  3. +0.12s: First sentence from LLM (31ms TTFT + ~20ms to complete sentence)
  4. +0.75s: First audio chunk from TTS (0.63s TTFB)
  5. +0.75s: Start Discord playback (parallel with TTS generation)
  6. +1.76s: First audio plays to user (0.75s + 1.01s playback startup)

Optimized Total: ~1.76s for first audio ✅

Long conversation (1000 messages):

  • First audio plays at +1.91s (still under 2s target) ✅

Recommendations

  1. Use Transformers STT with GPU - 90ms is excellent
  2. Use AWQ quantized models - 3B AWQ provides 31-180ms TTFT depending on context
  3. Implement conversation pruning - Keep history under 300 messages for <70ms TTFT
  4. Implement sentence-based streaming - Start TTS as soon as first complete sentence arrives
  5. Parallel execution - Don't wait for full TTS generation before starting playback
  6. ⚠️ Monitor conversation length - Use sliding window or summarization for very long chats
  7. ⚠️ Consider pre-warming audio stream - If possible, initialize FFmpeg early to avoid 1s startup penalty

Target Achievement Status

Goal: <2 second end-to-end latency

Metric Target Measured/Estimated Status
STT <200ms 90ms ✅ PASS
LLM TTFT (100 msgs) <300ms 31ms ✅ PASS
LLM TTFT (1000 msgs) <300ms 180ms ✅ PASS
TTS TTFB <1000ms 630ms ✅ PASS
Audio Startup <500ms 1010ms ❌ FAIL
Total (100 msgs) <2000ms ~1760ms ✅ PASS
Total (1000 msgs) <2000ms ~1910ms ✅ PASS

Conclusion

With the streaming architecture, the <2 second latency target is achievable

All benchmarks complete!

  • STT: 90ms ✅
  • LLM TTFT: 31-180ms (context-dependent) ✅
  • TTS TTFB: 630ms ✅
  • Audio Startup: 1010ms (unavoidable FFmpeg overhead)

Context length impact validated:

  • 100 messages (typical): 1.76s total latency ✅
  • 1000 messages (very long): 1.91s total latency ✅

Total end-to-end latency: ~1.76-1.91s - well under the 2-second target!