Skip to content

Commit 5b9a4e9

Browse files
committed
feat: basic speech-to-speech impl on top of llm_chain
1 parent 4624f55 commit 5b9a4e9

File tree

6 files changed

+1127
-0
lines changed

6 files changed

+1127
-0
lines changed
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# Speech-to-Speech (STS) with RAG
2+
3+
Execute a complete speech-to-speech workflow with knowledge base retrieval.
4+
5+
## Endpoint
6+
7+
```
8+
POST /llm/sts
9+
```
10+
11+
## Flow
12+
13+
```
14+
Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output
15+
```
16+
17+
## Input
18+
19+
- **Voice note**: WhatsApp-compatible audio format (required)
20+
- **Knowledge base IDs**: One or more knowledge bases for RAG (required)
21+
- **Languages**: Input and output languages (optional, defaults to Hindi)
22+
- **Models**: STT, LLM, and TTS model selection (optional, defaults to Sarvam)
23+
24+
## Output
25+
26+
You will receive **3 callbacks** to your webhook URL:
27+
28+
1. **STT Callback** (Intermediate): Transcribed text from audio
29+
2. **LLM Callback** (Intermediate): RAG-enhanced response text
30+
3. **TTS Callback** (Final): Audio output + response text
31+
32+
Each callback includes:
33+
- Output from that step
34+
- Token usage
35+
- Latency information (check timestamps)
36+
37+
## Supported Languages
38+
39+
### Primary Indian Languages
40+
- English, Hindi, Hinglish (code-switching)
41+
- Bengali, Kannada, Malayalam, Marathi
42+
- Odia, Punjabi, Tamil, Telugu, Gujarati
43+
44+
### Additional Languages (Sarvam Saaras V3)
45+
- Assamese, Urdu, Nepali
46+
- Konkani, Kashmiri, Sindhi
47+
- Sanskrit, Santali, Manipuri
48+
- Bodo, Maithili, Dogri
49+
50+
**Total: 25 languages** with automatic language detection
51+
52+
## Available Models
53+
54+
### STT (Speech-to-Text)
55+
- `saaras:v3` - Sarvam Saaras V3 (**default**, fast, auto language detection, optimized for Indian languages)
56+
- `gemini-2.5-pro` - Google Gemini 2.5 Pro
57+
58+
**Note:** Sarvam STT uses automatic language detection. No need to specify input language.
59+
60+
### LLM (RAG)
61+
- `gpt-4o` - OpenAI GPT-4o (**default**, best quality)
62+
- `gpt-4o-mini` - OpenAI GPT-4o Mini (faster, lower cost)
63+
64+
### TTS (Text-to-Speech)
65+
- `bulbul-v3` - Sarvam Bulbul V3 (**default**, natural Indian voices, MP3 output)
66+
- `gemini-2.5-pro-preview-tts` - Google Gemini 2.5 Pro (OGG OPUS output)
67+
68+
## Edge Cases & Error Handling
69+
70+
### Empty STT Output
71+
If speech-to-text returns empty/blank:
72+
- Chain fails immediately
73+
- Error message: "STT returned no transcription"
74+
- No subsequent blocks are executed
75+
76+
### Audio Size Limit
77+
WhatsApp limit: 16MB
78+
- TTS providers may fail if output exceeds limit
79+
- Error is caught and reported in callback
80+
- Consider using shorter responses or compression
81+
82+
### Invalid Audio Format
83+
If input audio format is unsupported:
84+
- STT provider fails with format error
85+
- Error reported in callback
86+
- Supported: MP3, WAV, OGG, OPUS, M4A
87+
88+
### Provider Failures
89+
Each block has independent error handling:
90+
- STT fails → Chain stops, STT error reported
91+
- LLM fails → Chain stops, RAG error reported
92+
- TTS fails → Chain stops, TTS error reported
93+
94+
## Example Request
95+
96+
```bash
97+
curl -X POST https://api.kaapi.ai/llm/sts \
98+
-H "Authorization: Bearer YOUR_API_KEY" \
99+
-H "Content-Type: application/json" \
100+
-d @- <<EOF
101+
{
102+
"audio": {
103+
"type": "audio",
104+
"content": {
105+
"format": "base64",
106+
"value": "base64_encoded_audio_data",
107+
"mime_type": "audio/ogg"
108+
}
109+
},
110+
"knowledge_base_ids": ["kb_abc123"],
111+
"input_language": "hindi",
112+
"output_language": "english",
113+
"callback_url": "https://your-app.com/webhook"
114+
}
115+
EOF
116+
```
117+
118+
**Note:** `stt_model`, `llm_model`, and `tts_model` are optional and will use defaults if not specified.
119+
120+
## Example Callbacks
121+
122+
### Callback 1: STT Output (Intermediate)
123+
```json
124+
{
125+
"success": true,
126+
"data": {
127+
"block_index": 1,
128+
"total_blocks": 3,
129+
"response": {
130+
"provider_response_id": "stt_xyz789",
131+
"provider": "sarvamai-native",
132+
"model": "saarika:v1",
133+
"output": {
134+
"type": "text",
135+
"content": {
136+
"value": "नमस्ते, मुझे अपने अकाउंट के बारे में जानकारी चाहिए"
137+
}
138+
}
139+
},
140+
"usage": {
141+
"input_tokens": 0,
142+
"output_tokens": 12,
143+
"total_tokens": 12
144+
}
145+
},
146+
"metadata": {
147+
"speech_to_speech": true,
148+
"input_language": "hi-IN"
149+
}
150+
}
151+
```
152+
153+
### Callback 2: LLM Output (Intermediate)
154+
```json
155+
{
156+
"success": true,
157+
"data": {
158+
"block_index": 2,
159+
"total_blocks": 3,
160+
"response": {
161+
"provider_response_id": "chatcmpl_abc123",
162+
"provider": "openai",
163+
"model": "gpt-4o",
164+
"output": {
165+
"type": "text",
166+
"content": {
167+
"value": "आपके अकाउंट में कुल बैलेंस ₹5,000 है। पिछले महीने में 3 ट्रांजैक्शन हुए हैं।"
168+
}
169+
}
170+
},
171+
"usage": {
172+
"input_tokens": 150,
173+
"output_tokens": 45,
174+
"total_tokens": 195
175+
}
176+
},
177+
"metadata": {
178+
"speech_to_speech": true
179+
}
180+
}
181+
```
182+
183+
### Callback 3: TTS Output (Final)
184+
```json
185+
{
186+
"success": true,
187+
"data": {
188+
"response": {
189+
"provider_response_id": "tts_def456",
190+
"provider": "sarvamai-native",
191+
"model": "bulbul:v1",
192+
"output": {
193+
"type": "audio",
194+
"content": {
195+
"format": "base64",
196+
"value": "base64_encoded_audio_output",
197+
"mime_type": "audio/ogg"
198+
}
199+
}
200+
},
201+
"usage": {
202+
"input_tokens": 15,
203+
"output_tokens": 0,
204+
"total_tokens": 15
205+
}
206+
},
207+
"metadata": {
208+
"speech_to_speech": true,
209+
"output_language": "hi-IN"
210+
}
211+
}
212+
```
213+
214+
## Latency Tracking
215+
216+
Calculate latency from callback timestamps:
217+
- **STT latency**: Time from request to first callback
218+
- **LLM latency**: Time between first and second callback
219+
- **TTS latency**: Time between second and third callback
220+
- **Total latency**: Time from request to final callback
221+
222+
## Best Practices
223+
224+
1. **Language Consistency**: If not translating, keep input_language = output_language
225+
2. **Model Selection**: Use Sarvam models for Indian languages (faster, better quality)
226+
3. **Knowledge Base**: Ensure KB is properly indexed and relevant to expected queries
227+
4. **Error Handling**: Implement retry logic for transient provider failures
228+
5. **Webhook Security**: Validate webhook signatures and use HTTPS

backend/app/api/main.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
languages,
1212
llm,
1313
llm_chain,
14+
llm_speech,
1415
organization,
1516
openai_conversation,
1617
project,
@@ -43,6 +44,7 @@
4344
api_router.include_router(languages.router)
4445
api_router.include_router(llm.router)
4546
api_router.include_router(llm_chain.router)
47+
api_router.include_router(llm_speech.router)
4648
api_router.include_router(login.router)
4749
api_router.include_router(onboarding.router)
4850
api_router.include_router(openai_conversation.router)
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
"""Speech-to-Speech (STS) API endpoint with RAG."""
2+
3+
import logging
4+
5+
from fastapi import APIRouter, Depends
6+
7+
from app.api.deps import AuthContextDep, SessionDep
8+
from app.api.permissions import Permission, require_permission
9+
from app.models import Message
10+
from app.models.llm.request import (
11+
LLMChainRequest,
12+
QueryParams,
13+
SpeechToSpeechRequest,
14+
)
15+
from app.services.llm.chain.utils import (
16+
LANGUAGE_CODES,
17+
build_rag_block,
18+
build_stt_block,
19+
build_tts_block,
20+
get_language_code,
21+
)
22+
from app.services.llm.jobs import start_chain_job
23+
from app.utils import APIResponse, load_description, validate_callback_url
24+
25+
logger = logging.getLogger(__name__)
26+
27+
router = APIRouter(tags=["LLM"])
28+
29+
30+
@router.post(
31+
"/llm/sts",
32+
description=load_description("llm/speech_to_speech.md"),
33+
response_model=APIResponse[Message],
34+
dependencies=[Depends(require_permission(Permission.REQUIRE_PROJECT))],
35+
)
36+
def speech_to_speech(
37+
_current_user: AuthContextDep,
38+
_session: SessionDep,
39+
request: SpeechToSpeechRequest,
40+
):
41+
"""
42+
Speech-to-speech (STS) endpoint with RAG.
43+
44+
Executes a 3-block chain:
45+
1. STT (Speech-to-Text) - Transcribes audio to text (auto-detects language for Sarvam)
46+
2. RAG (Retrieval-Augmented Generation) - Processes text with knowledge base
47+
3. TTS (Text-to-Speech) - Converts response back to audio
48+
49+
Input: Voice note (WhatsApp compatible)
50+
Output: Voice note + text (via callback)
51+
52+
Edge cases:
53+
- Empty STT output: Chain fails with clear error
54+
- Audio > 16MB: TTS provider will fail (caught and reported)
55+
- Invalid audio format: STT provider will fail (caught and reported)
56+
"""
57+
project_id = _current_user.project_.id
58+
organization_id = _current_user.organization_.id
59+
60+
# Validate callback URL
61+
if request.callback_url:
62+
validate_callback_url(str(request.callback_url))
63+
64+
# Validate and determine languages
65+
if request.input_language and request.input_language != "auto":
66+
if request.input_language not in LANGUAGE_CODES:
67+
from fastapi import HTTPException
68+
69+
raise HTTPException(
70+
status_code=400,
71+
detail=f"Unsupported input language: {request.input_language}. Supported: {', '.join(LANGUAGE_CODES.keys())}",
72+
)
73+
74+
if request.output_language and request.output_language not in LANGUAGE_CODES:
75+
from fastapi import HTTPException
76+
77+
raise HTTPException(
78+
status_code=400,
79+
detail=f"Unsupported output language: {request.output_language}. Supported: {', '.join(LANGUAGE_CODES.keys())}",
80+
)
81+
82+
input_lang_code = get_language_code(request.input_language)
83+
output_lang_code = get_language_code(
84+
request.output_language, default=request.input_language or "auto"
85+
)
86+
87+
logger.info(
88+
f"[speech_to_speech] Starting STS chain | "
89+
f"project_id={project_id}, "
90+
f"input_lang={input_lang_code}, "
91+
f"output_lang={output_lang_code}, "
92+
f"stt_model={request.stt_model.value}, "
93+
f"llm_model={request.llm_model.value}, "
94+
f"tts_model={request.tts_model.value}"
95+
)
96+
97+
# Build 3-block chain: STT → RAG → TTS
98+
blocks = [
99+
build_stt_block(request.stt_model, input_lang_code),
100+
build_rag_block(request.llm_model, request.knowledge_base_ids),
101+
build_tts_block(request.tts_model, output_lang_code),
102+
]
103+
104+
# Add metadata to track STS-specific info
105+
metadata = request.request_metadata or {}
106+
metadata.update(
107+
{
108+
"speech_to_speech": True,
109+
"input_language": input_lang_code,
110+
"output_language": output_lang_code,
111+
"stt_model": request.stt_model.value,
112+
"llm_model": request.llm_model.value,
113+
"tts_model": request.tts_model.value,
114+
}
115+
)
116+
117+
# Create chain request
118+
chain_request = LLMChainRequest(
119+
query=QueryParams(input=request.audio),
120+
blocks=blocks,
121+
callback_url=request.callback_url,
122+
request_metadata=metadata,
123+
)
124+
125+
# Start async chain job
126+
start_chain_job(
127+
db=_session,
128+
request=chain_request,
129+
project_id=project_id,
130+
organization_id=organization_id,
131+
)
132+
133+
return APIResponse.success_response(
134+
data=Message(
135+
message=(
136+
"Speech-to-speech processing initiated. "
137+
"You will receive intermediate callbacks for STT and LLM outputs, "
138+
"followed by the final callback with audio and text."
139+
)
140+
)
141+
)

0 commit comments

Comments
 (0)