Skip to content

Conversation

@Nsuccess
Copy link

Description

This PR adds a Whisper STT (Speech-to-Text) extension for the TEN Framework using the faster-whisper library.

Closes #1969

Features

  • Complete ASR Implementation: Inherits from AsyncASRBaseExtension following TEN Framework patterns
  • Optimized Performance: Uses faster-whisper (4x faster than openai/whisper)
  • Multiple Model Sizes: Support for tiny, base, small, medium, large-v1/v2/v3
  • CPU & GPU Support: Configurable device and compute types (int8, float16, float32)
  • Multi-Language: 99+ languages with automatic detection
  • Translation: Translate speech to English
  • VAD Filtering: Built-in voice activity detection using Silero VAD
  • Production-Ready: Auto-reconnection, audio dumping, standardized logging
  • Buffer Strategy: Keep mode with 10MB limit for timestamp accuracy

Implementation Details

Architecture

  • Extension: WhisperSTTExtension - Main ASR extension class
  • Client: WhisperClient - Handles faster-whisper model and inference
  • Config: WhisperSTTConfig - Pass-through params design for flexibility
  • Reconnection: Exponential backoff with configurable max attempts

Files Added (14 files, 1,324 lines)

  • whisper_stt_python/extension.py - Main extension implementation
  • whisper_stt_python/whisper_client.py - Faster-whisper client wrapper
  • whisper_stt_python/config.py - Configuration management
  • whisper_stt_python/reconnect_manager.py - Auto-reconnection logic
  • whisper_stt_python/addon.py - Extension entry point
  • whisper_stt_python/const.py - Constants
  • whisper_stt_python/manifest.json - Extension metadata
  • whisper_stt_python/property.json - Default configuration
  • whisper_stt_python/requirements.txt - Dependencies
  • whisper_stt_python/README.md - Comprehensive documentation
  • whisper_stt_python/tests/test_config.py - Config tests (10 tests)
  • whisper_stt_python/tests/test_extension.py - Extension tests (15 tests)

Testing

  • 25 Unit Tests: Full coverage with mock-based testing
  • Config Tests: Default values, JSON parsing, sensitive masking, language normalization
  • Extension Tests: Initialization, connection, audio sending, finalize, callbacks
  • No Real API Calls: All tests use mocks for fast, reliable execution

Configuration Example

{
  "params": {
    "model": "base",
    "device": "cpu",
    "compute_type": "int8",
    "language": "en",
    "task": "transcribe",
    "sample_rate": 16000
  }
}

Nsuccess added 3 commits January 14, 2026 16:46
- Implements text-to-speech using NVIDIA Riva Speech Skills
- Supports streaming synthesis with gRPC
- Includes comprehensive tests and documentation
- Follows TTS2 interface pattern

Closes TEN-framework#1964
- Implements text-to-speech using Speechmatics TTS API
- Supports low-latency streaming synthesis (sub-150ms)
- Includes 4 voice options (UK and US English)
- Comprehensive tests and documentation
- Follows TTS2 HTTP interface pattern

Closes TEN-framework#1965
- Implements ASR extension for OpenAI Whisper model
- Uses faster-whisper library (4x faster than openai/whisper)
- Supports all Whisper model sizes (tiny to large-v3)
- CPU and GPU execution with multiple compute types
- 99+ languages support with auto-detection
- Translation to English capability
- VAD filtering with Silero VAD
- Auto-reconnection with exponential backoff
- Audio dumping for debugging
- Keep mode buffer strategy for timestamp accuracy
- 25 unit tests with mock-based testing
- Comprehensive documentation and examples

Closes TEN-framework#1969
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[2026NewYearChallenge 🏅] Create a Whisper STT Extension

1 participant