|
| 1 | +# Speech-to-Speech (STS) with RAG |
| 2 | + |
| 3 | +Execute a complete speech-to-speech workflow with knowledge base retrieval. |
| 4 | + |
| 5 | +## Endpoint |
| 6 | + |
| 7 | +``` |
| 8 | +POST /llm/sts |
| 9 | +``` |
| 10 | + |
| 11 | +## Flow |
| 12 | + |
| 13 | +``` |
| 14 | +Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output |
| 15 | +``` |
| 16 | + |
| 17 | +## Input |
| 18 | + |
| 19 | +- **Voice note**: WhatsApp-compatible audio format (required) |
| 20 | +- **Knowledge base IDs**: One or more knowledge bases for RAG (required) |
| 21 | +- **Languages**: Input and output languages (optional, defaults to Hindi) |
| 22 | +- **Models**: STT, LLM, and TTS model selection (optional, defaults to Sarvam) |
| 23 | + |
| 24 | +## Output |
| 25 | + |
| 26 | +You will receive **3 callbacks** to your webhook URL: |
| 27 | + |
| 28 | +1. **STT Callback** (Intermediate): Transcribed text from audio |
| 29 | +2. **LLM Callback** (Intermediate): RAG-enhanced response text |
| 30 | +3. **TTS Callback** (Final): Audio output + response text |
| 31 | + |
| 32 | +Each callback includes: |
| 33 | +- Output from that step |
| 34 | +- Token usage |
| 35 | +- Latency information (check timestamps) |
| 36 | + |
| 37 | +## Supported Languages |
| 38 | + |
| 39 | +### Primary Indian Languages |
| 40 | +- English, Hindi, Hinglish (code-switching) |
| 41 | +- Bengali, Kannada, Malayalam, Marathi |
| 42 | +- Odia, Punjabi, Tamil, Telugu, Gujarati |
| 43 | + |
| 44 | +### Additional Languages (Sarvam Saaras V3) |
| 45 | +- Assamese, Urdu, Nepali |
| 46 | +- Konkani, Kashmiri, Sindhi |
| 47 | +- Sanskrit, Santali, Manipuri |
| 48 | +- Bodo, Maithili, Dogri |
| 49 | + |
| 50 | +**Total: 25 languages** with automatic language detection |
| 51 | + |
| 52 | +## Available Models |
| 53 | + |
| 54 | +### STT (Speech-to-Text) |
| 55 | +- `saaras:v3` - Sarvam Saaras V3 (**default**, fast, auto language detection, optimized for Indian languages) |
| 56 | +- `gemini-2.5-pro` - Google Gemini 2.5 Pro |
| 57 | + |
| 58 | +**Note:** Sarvam STT uses automatic language detection. No need to specify input language. |
| 59 | + |
| 60 | +### LLM (RAG) |
| 61 | +- `gpt-4o` - OpenAI GPT-4o (**default**, best quality) |
| 62 | +- `gpt-4o-mini` - OpenAI GPT-4o Mini (faster, lower cost) |
| 63 | + |
| 64 | +### TTS (Text-to-Speech) |
| 65 | +- `bulbul-v3` - Sarvam Bulbul V3 (**default**, natural Indian voices, MP3 output) |
| 66 | +- `gemini-2.5-pro-preview-tts` - Google Gemini 2.5 Pro (OGG OPUS output) |
| 67 | + |
| 68 | +## Edge Cases & Error Handling |
| 69 | + |
| 70 | +### Empty STT Output |
| 71 | +If speech-to-text returns empty/blank: |
| 72 | +- Chain fails immediately |
| 73 | +- Error message: "STT returned no transcription" |
| 74 | +- No subsequent blocks are executed |
| 75 | + |
| 76 | +### Audio Size Limit |
| 77 | +WhatsApp limit: 16MB |
| 78 | +- TTS providers may fail if output exceeds limit |
| 79 | +- Error is caught and reported in callback |
| 80 | +- Consider using shorter responses or compression |
| 81 | + |
| 82 | +### Invalid Audio Format |
| 83 | +If input audio format is unsupported: |
| 84 | +- STT provider fails with format error |
| 85 | +- Error reported in callback |
| 86 | +- Supported: MP3, WAV, OGG, OPUS, M4A |
| 87 | + |
| 88 | +### Provider Failures |
| 89 | +Each block has independent error handling: |
| 90 | +- STT fails → Chain stops, STT error reported |
| 91 | +- LLM fails → Chain stops, RAG error reported |
| 92 | +- TTS fails → Chain stops, TTS error reported |
| 93 | + |
| 94 | +## Example Request |
| 95 | + |
| 96 | +```bash |
| 97 | +curl -X POST https://api.kaapi.ai/llm/sts \ |
| 98 | + -H "Authorization: Bearer YOUR_API_KEY" \ |
| 99 | + -H "Content-Type: application/json" \ |
| 100 | + -d @- <<EOF |
| 101 | +{ |
| 102 | + "audio": { |
| 103 | + "type": "audio", |
| 104 | + "content": { |
| 105 | + "format": "base64", |
| 106 | + "value": "base64_encoded_audio_data", |
| 107 | + "mime_type": "audio/ogg" |
| 108 | + } |
| 109 | + }, |
| 110 | + "knowledge_base_ids": ["kb_abc123"], |
| 111 | + "input_language": "hindi", |
| 112 | + "output_language": "english", |
| 113 | + "callback_url": "https://your-app.com/webhook" |
| 114 | +} |
| 115 | +EOF |
| 116 | +``` |
| 117 | + |
| 118 | +**Note:** `stt_model`, `llm_model`, and `tts_model` are optional and will use defaults if not specified. |
| 119 | + |
| 120 | +## Example Callbacks |
| 121 | + |
| 122 | +### Callback 1: STT Output (Intermediate) |
| 123 | +```json |
| 124 | +{ |
| 125 | + "success": true, |
| 126 | + "data": { |
| 127 | + "block_index": 1, |
| 128 | + "total_blocks": 3, |
| 129 | + "response": { |
| 130 | + "provider_response_id": "stt_xyz789", |
| 131 | + "provider": "sarvamai-native", |
| 132 | + "model": "saarika:v1", |
| 133 | + "output": { |
| 134 | + "type": "text", |
| 135 | + "content": { |
| 136 | + "value": "नमस्ते, मुझे अपने अकाउंट के बारे में जानकारी चाहिए" |
| 137 | + } |
| 138 | + } |
| 139 | + }, |
| 140 | + "usage": { |
| 141 | + "input_tokens": 0, |
| 142 | + "output_tokens": 12, |
| 143 | + "total_tokens": 12 |
| 144 | + } |
| 145 | + }, |
| 146 | + "metadata": { |
| 147 | + "speech_to_speech": true, |
| 148 | + "input_language": "hi-IN" |
| 149 | + } |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +### Callback 2: LLM Output (Intermediate) |
| 154 | +```json |
| 155 | +{ |
| 156 | + "success": true, |
| 157 | + "data": { |
| 158 | + "block_index": 2, |
| 159 | + "total_blocks": 3, |
| 160 | + "response": { |
| 161 | + "provider_response_id": "chatcmpl_abc123", |
| 162 | + "provider": "openai", |
| 163 | + "model": "gpt-4o", |
| 164 | + "output": { |
| 165 | + "type": "text", |
| 166 | + "content": { |
| 167 | + "value": "आपके अकाउंट में कुल बैलेंस ₹5,000 है। पिछले महीने में 3 ट्रांजैक्शन हुए हैं।" |
| 168 | + } |
| 169 | + } |
| 170 | + }, |
| 171 | + "usage": { |
| 172 | + "input_tokens": 150, |
| 173 | + "output_tokens": 45, |
| 174 | + "total_tokens": 195 |
| 175 | + } |
| 176 | + }, |
| 177 | + "metadata": { |
| 178 | + "speech_to_speech": true |
| 179 | + } |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +### Callback 3: TTS Output (Final) |
| 184 | +```json |
| 185 | +{ |
| 186 | + "success": true, |
| 187 | + "data": { |
| 188 | + "response": { |
| 189 | + "provider_response_id": "tts_def456", |
| 190 | + "provider": "sarvamai-native", |
| 191 | + "model": "bulbul:v1", |
| 192 | + "output": { |
| 193 | + "type": "audio", |
| 194 | + "content": { |
| 195 | + "format": "base64", |
| 196 | + "value": "base64_encoded_audio_output", |
| 197 | + "mime_type": "audio/ogg" |
| 198 | + } |
| 199 | + } |
| 200 | + }, |
| 201 | + "usage": { |
| 202 | + "input_tokens": 15, |
| 203 | + "output_tokens": 0, |
| 204 | + "total_tokens": 15 |
| 205 | + } |
| 206 | + }, |
| 207 | + "metadata": { |
| 208 | + "speech_to_speech": true, |
| 209 | + "output_language": "hi-IN" |
| 210 | + } |
| 211 | +} |
| 212 | +``` |
| 213 | + |
| 214 | +## Latency Tracking |
| 215 | + |
| 216 | +Calculate latency from callback timestamps: |
| 217 | +- **STT latency**: Time from request to first callback |
| 218 | +- **LLM latency**: Time between first and second callback |
| 219 | +- **TTS latency**: Time between second and third callback |
| 220 | +- **Total latency**: Time from request to final callback |
| 221 | + |
| 222 | +## Best Practices |
| 223 | + |
| 224 | +1. **Language Consistency**: If not translating, keep input_language = output_language |
| 225 | +2. **Model Selection**: Use Sarvam models for Indian languages (faster, better quality) |
| 226 | +3. **Knowledge Base**: Ensure KB is properly indexed and relevant to expected queries |
| 227 | +4. **Error Handling**: Implement retry logic for transient provider failures |
| 228 | +5. **Webhook Security**: Validate webhook signatures and use HTTPS |
0 commit comments