Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
# NVIDIA Riva TTS Extension - Implementation Details

## Overview

This document describes the implementation of the NVIDIA Riva TTS extension for TEN Framework. The extension provides high-quality, GPU-accelerated text-to-speech synthesis using NVIDIA Riva Speech Skills.

## Architecture

### Component Structure

```
nvidia_riva_tts_python/
├── extension.py # Main extension class
├── riva_tts.py # Riva client implementation
├── config.py # Configuration model
├── addon.py # Extension registration
├── manifest.json # Extension metadata
├── property.json # Default properties
├── requirements.txt # Python dependencies
├── README.md # User documentation
└── tests/ # Test suite
├── test_config.py
└── test_extension.py
```

### Class Hierarchy

```
AsyncTTSExtension (base class from ten_ai_base)
└── NvidiaRivaTTSExtension
└── uses NvidiaRivaTTSClient
└── uses riva.client.SpeechSynthesisService
```

## Implementation Details

### 1. Extension Class (`extension.py`)

The `NvidiaRivaTTSExtension` class inherits from `AsyncTTSExtension` and implements the required abstract methods:

- **`create_config()`**: Parses JSON configuration into `NvidiaRivaTTSConfig`
- **`create_client()`**: Instantiates `NvidiaRivaTTSClient` with configuration
- **`vendor()`**: Returns "nvidia_riva" as the vendor identifier
- **`synthesize_audio_sample_rate()`**: Returns the configured sample rate

### 2. Client Implementation (`riva_tts.py`)

The `NvidiaRivaTTSClient` class handles the actual TTS synthesis:

#### Initialization
- Creates Riva Auth object with server URI and SSL settings
- Initializes `SpeechSynthesisService` for TTS operations
- Validates server connectivity

#### Synthesis Method
```python
async def synthesize(self, text: str, request_id: str) -> AsyncIterator[bytes]
```

**Flow:**
1. Validates input text (non-empty)
2. Calls `tts_service.synthesize_online()` for streaming synthesis
3. Iterates through audio chunks from Riva
4. Converts audio data to PCM bytes
5. Yields audio chunks for streaming playback
6. Handles cancellation requests

**Key Features:**
- Streaming synthesis for low latency
- Cancellation support via `_is_cancelled` flag
- Comprehensive logging at each step
- Error handling with detailed messages

### 3. Configuration (`config.py`)

The `NvidiaRivaTTSConfig` class extends `AsyncTTSConfig`:

**Required Parameters:**
- `server`: Riva server address (host:port)
- `language_code`: Language identifier (e.g., "en-US")
- `voice_name`: Voice identifier (e.g., "English-US.Female-1")

**Optional Parameters:**
- `sample_rate`: Audio sample rate in Hz (default: 16000)
- `use_ssl`: Enable SSL for gRPC (default: false)

**Validation:**
- Ensures all required parameters are present
- Validates parameter types and formats

### 4. Addon Registration (`addon.py`)

Registers the extension with TEN Framework using the `@register_addon_as_extension` decorator.

## Integration with TEN Framework

### TTS Interface Compliance

The extension implements the standard TEN Framework TTS interface defined in `ten_ai_base/api/tts-interface.json`:

- **Input**: Text data via TEN data messages
- **Output**: PCM audio frames via TEN audio frame messages
- **Control**: Start/stop/cancel commands via TEN commands

### Message Flow

```
1. Text Input → Extension receives text data
2. Configuration → Loads voice, language, sample rate
3. Synthesis → Calls Riva API with streaming
4. Audio Output → Yields PCM audio chunks
5. Completion → Signals end of synthesis
```

## NVIDIA Riva Integration

### gRPC API Usage

The extension uses the official `nvidia-riva-client` Python package which provides:

- **Auth**: Authentication and connection management
- **SpeechSynthesisService**: TTS API wrapper
- **AudioEncoding**: Audio format specifications

### Streaming vs Batch

The implementation uses **streaming synthesis** (`synthesize_online`) for:
- Lower latency (first audio chunk arrives quickly)
- Better user experience in real-time applications
- Efficient memory usage

Alternative batch mode (`synthesize`) is available but not used by default.

### Audio Format

- **Encoding**: LINEAR_PCM (16-bit signed integer)
- **Sample Rate**: Configurable (default 16000 Hz)
- **Channels**: Mono
- **Byte Order**: Little-endian

## Error Handling

### Initialization Errors
- Server unreachable → RuntimeError with connection details
- Invalid credentials → Authentication error
- Missing dependencies → Import error

### Runtime Errors
- Empty text → Warning logged, no synthesis
- Synthesis failure → RuntimeError with Riva error message
- Cancellation → Graceful stop, log cancellation

### Logging Strategy

- **INFO**: Initialization, configuration
- **DEBUG**: Synthesis progress, chunk details
- **WARN**: Empty text, unusual conditions
- **ERROR**: Failures, exceptions

## Testing

### Test Coverage

1. **Configuration Tests** (`test_config.py`)
- Valid configuration creation
- Missing required parameters
- Default values
- Validation logic

2. **Extension Tests** (`test_extension.py`)
- Extension initialization
- Config creation from JSON
- Sample rate retrieval
- Client creation

3. **Client Tests** (`test_extension.py`)
- Client initialization with mocked Riva
- Cancellation handling
- Empty text handling
- Synthesis with mocked responses

### Running Tests

```bash
# Install test dependencies
pip install pytest pytest-asyncio

# Run all tests
pytest nvidia_riva_tts_python/tests/ -v

# Run with coverage
pytest nvidia_riva_tts_python/tests/ --cov=nvidia_riva_tts_python
```

## Performance Considerations

### Latency
- **First chunk**: ~100-200ms (depends on text length and server)
- **Streaming**: Continuous audio delivery
- **GPU acceleration**: Significantly faster than CPU-only TTS

### Resource Usage
- **Memory**: Minimal (streaming mode)
- **Network**: gRPC connection to Riva server
- **CPU**: Low (Riva does GPU processing)

### Optimization Tips
1. Use streaming mode for real-time applications
2. Keep Riva server close to application (low network latency)
3. Reuse client connections (handled by extension)
4. Configure appropriate sample rate for use case

## Deployment

### Prerequisites
1. NVIDIA Riva server running (see README.md for setup)
2. Network connectivity to Riva server
3. Python 3.8+ with nvidia-riva-client

### Configuration Example

```json
{
"params": {
"server": "riva-server.example.com:50051",
"language_code": "en-US",
"voice_name": "English-US.Female-1",
"sample_rate": 22050,
"use_ssl": true
}
}
```

### Environment Variables

```bash
export NVIDIA_RIVA_SERVER=localhost:50051
```

## Future Enhancements

Potential improvements for future versions:

1. **SSML Support**: Full SSML tag support for advanced speech control
2. **Voice Cloning**: Custom voice model support
3. **Multi-language**: Automatic language detection
4. **Caching**: Cache frequently synthesized phrases
5. **Metrics**: Detailed performance metrics and monitoring
6. **Fallback**: Automatic fallback to alternative TTS if Riva unavailable

## References

- [NVIDIA Riva Documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)
- [Riva Python Client](https://pypi.org/project/nvidia-riva-client/)
- [TEN Framework TTS Interface](https://github.com/TEN-framework/ten-framework)
- [gRPC Python](https://grpc.io/docs/languages/python/)

## License

Apache 2.0 - See LICENSE file in the TEN Framework repository.

## Contributing

Contributions are welcome! Please:
1. Follow the existing code style
2. Add tests for new features
3. Update documentation
4. Submit PR to TEN Framework repository

93 changes: 93 additions & 0 deletions agents/ten_packages/extension/nvidia_riva_tts_python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# NVIDIA Riva TTS Python Extension

This extension provides text-to-speech functionality using NVIDIA Riva Speech Skills.

## Features

- High-quality speech synthesis using NVIDIA Riva
- Support for multiple languages and voices
- Streaming and batch synthesis modes
- SSML support for advanced speech control
- GPU-accelerated inference for low latency

## Prerequisites

- NVIDIA Riva server running and accessible
- Python 3.8+
- nvidia-riva-client package

## Configuration

The extension can be configured through your property.json:

```json
{
"params": {
"server": "localhost:50051",
"language_code": "en-US",
"voice_name": "English-US.Female-1",
"sample_rate": 16000,
"use_ssl": false
}
}
```

### Configuration Options

**Parameters inside `params` object:**
- `server` (required): Riva server address (format: "host:port")
- `language_code` (required): Language code (e.g., "en-US", "es-ES")
- `voice_name` (required): Voice identifier (e.g., "English-US.Female-1")
- `sample_rate` (optional): Audio sample rate in Hz (default: 16000)
- `use_ssl` (optional): Use SSL for gRPC connection (default: false)

### Available Voices

Common voice names include:
- `English-US.Female-1`
- `English-US.Male-1`
- `English-GB.Female-1`
- `Spanish-US.Female-1`

Check your Riva server documentation for the full list of available voices.

## Setting up NVIDIA Riva Server

Follow the [NVIDIA Riva Quick Start Guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html) to set up a Riva server.

Quick setup with Docker:

```bash
# Download Riva Quick Start scripts
ngc registry resource download-version nvidia/riva/riva_quickstart:2.17.0

# Initialize and start Riva
cd riva_quickstart_v2.17.0
bash riva_init.sh
bash riva_start.sh
```

## Environment Variables

Set the Riva server address via environment variable:

```bash
export NVIDIA_RIVA_SERVER=localhost:50051
```

## Architecture

This extension follows the TEN Framework TTS extension pattern:

- `extension.py`: Main extension class
- `riva_tts.py`: Client implementation with Riva SDK integration
- `config.py`: Configuration model
- `addon.py`: Extension addon registration

## License

Apache 2.0

## Contributing

Contributions are welcome! Please submit issues and pull requests to the TEN Framework repository.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#
# This file is part of TEN Framework, an open source project.
# Licensed under the Apache License, Version 2.0.
#
from . import addon

__all__ = ["addon"]
18 changes: 18 additions & 0 deletions agents/ten_packages/extension/nvidia_riva_tts_python/addon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#
# This file is part of TEN Framework, an open source project.
# Licensed under the Apache License, Version 2.0.
#
from ten_runtime import (
Addon,
register_addon_as_extension,
TenEnv,
)


@register_addon_as_extension("nvidia_riva_tts_python")
class NvidiaRivaTTSExtensionAddon(Addon):
def on_create_instance(self, ten_env: TenEnv, name: str, context) -> None:
from .extension import NvidiaRivaTTSExtension

ten_env.log_info("NvidiaRivaTTSExtensionAddon on_create_instance")
ten_env.on_create_instance_done(NvidiaRivaTTSExtension(name), context)
Loading