Skip to content

Commit f15ba37

Browse files
authored
Merge pull request #8460 from ovh/mb-ai-endpints-replace-riva
[AI Endpoints] - Replace RIVA by Whisper Model
2 parents 25d679a + ddb1282 commit f15ba37

File tree

15 files changed

+1934
-884
lines changed

15 files changed

+1934
-884
lines changed

pages/public_cloud/ai_machine_learning/endpoints_tuto_02_voice_virtual_assistant/guide.de-de.md

Lines changed: 129 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: AI Endpoints - Create your own voice assistant
33
excerpt: Create a voice-enabled chatbot using ASR, LLM, and TTS endpoints in under 100 lines of code
4-
updated: 2025-07-31
4+
updated: 2025-10-01
55
---
66

77
> [!primary]
@@ -44,14 +44,16 @@ All of this is done by connecting **AI Endpoints** like puzzle pieces—allowing
4444
In order to use AI Endpoints APIs easily, create a `.env` file to store environment variables:
4545

4646
```bash
47-
ASR_GRPC_ENDPOINT=nvr-asr-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443
47+
ASR_AI_ENDPOINT=https://whisper-large-v3.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1
4848
TTS_GRPC_ENDPOINT=nvr-tts-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443
4949
LLM_AI_ENDPOINT=https://mixtral-8x7b-instruct-v01.endpoints.kepler.ai.cloud.ovh.net/api/openai_compat/v1
5050
OVH_AI_ENDPOINTS_ACCESS_TOKEN=<ai-endpoints-api-token>
5151
```
5252

5353
**Make sure to replace the token value (`OVH_AI_ENDPOINTS_ACCESS_TOKEN`) by yours.** If you do not have one yet, follow the instructions in the [AI Endpoints - Getting Started](/pages/public_cloud/ai_machine_learning/endpoints_guide_01_getting_started) guide.
5454

55+
In this tutorial, we will be using the `Whisper-Large-V3` and `Mixtral-8x7b-Instruct-V01` models. Feel free to choose alternative models available on the [AI Endpoints catalog](https://catalog.endpoints.ai.ovh.net/).
56+
5557
Then, create a `requirements.txt` file with the following libraries:
5658

5759
```bash
@@ -89,62 +91,89 @@ After these lines, load and access the environnement variables of your `.env` fi
8991
```python
9092
# access the environment variables from the .env file
9193
load_dotenv()
94+
95+
ASR_AI_ENDPOINT = os.environ.get('ASR_AI_ENDPOINT')
96+
TTS_GRPC_ENDPOINT = os.environ.get('TTS_GRPC_ENDPOINT')
97+
LLM_AI_ENDPOINT = os.environ.get('LLM_AI_ENDPOINT')
98+
OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')
99+
```
100+
101+
Next, define the clients that will be used to interact with the models:
102+
103+
```python
104+
llm_client = OpenAI(
105+
base_url=LLM_AI_ENDPOINT,
106+
api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN
107+
)
108+
109+
tts_client = riva.client.SpeechSynthesisService(
110+
riva.client.Auth(
111+
uri=TTS_GRPC_ENDPOINT,
112+
use_ssl=True,
113+
metadata_args=[["authorization", f"bearer {OVH_AI_ENDPOINTS_ACCESS_TOKEN}"]]
114+
)
115+
)
116+
117+
asr_client = OpenAI(
118+
base_url=ASR_AI_ENDPOINT,
119+
api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN
120+
)
92121
```
93122

94123
💡 You are now ready to start coding your web app!
95124

96125
### Transcribe input question with ASR
97126

98-
First, create the **Automatic Speech Recognition (ASR)** function in order to transcribe microphone input into text.
127+
First, create the **Automatic Speech Recognition (ASR)** function in order to transcribe microphone input into text:
99128

100129
```python
101-
def asr_transcription(question):
130+
def asr_transcription(question, asr_client):
131+
return asr_client.audio.transcriptions.create(
132+
model="whisper-large-v3",
133+
file=question
134+
).text
135+
```
102136

103-
asr_service = riva.client.ASRService(
104-
riva.client.Auth(uri=os.environ.get('ASR_GRPC_ENDPOINT'), use_ssl=True, metadata_args=[["authorization", f"bearer {os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}"]])
105-
)
106-
107-
# set up config
108-
asr_config = riva.client.RecognitionConfig(
109-
language_code="en-US", # languages: en-US
110-
max_alternatives=1,
111-
enable_automatic_punctuation=True,
112-
audio_channel_count = 1,
113-
)
137+
**In this function:**
138+
139+
- The audio input is sent from microphone recording, as `question`.
140+
- An API call is made to the ASR AI Endpoint named `whisper-large-v3`.
141+
- The text from the transcript response is returned by the function.
142+
143+
🎉 Now that you have this function, you are ready to transcribe audio files.
144+
145+
### Generate LLM response to input question
114146

115-
# get asr model response
116-
response = asr_service.offline_recognize(question, asr_config)
147+
Now, create a function that calls the LLM client to provide responses to questions:
117148

118-
return response.results[0].alternatives[0].transcript
149+
```python
150+
def llm_answer(input, llm_client):
151+
response = llm_client.chat.completions.create(
152+
model="Mixtral-8x7B-Instruct-v0.1",
153+
messages=input,
154+
temperature=0,
155+
max_tokens=1024,
156+
)
157+
msg = response.choices[0].message.content
158+
159+
return msg
119160
```
120161

121162
**In this function:**
122163

123-
- The audio input is sent from microphone recording
124-
- An API call is made to the ASR AI Endpoint named `nvr-asr-en-gb`
125-
- The full response is stored in `resp` variable and returned by the function
164+
- The conversation/messages are retrieved as parameters.
165+
- A call is made to the chat completion LLM endpoint, using the `Mixtral8x7B` model.
166+
- The model's response is extracted and the final message text is returned.
126167

127-
🎉 Now that you have this function, you are ready to transcribe audio files.
128-
129-
Now it’s time to implement the TTS to transform the LLM response into spoken words.
168+
⏳ Almost there! All that remains is to implement the TTS to transform the LLM response into spoken words.
130169

131170
### Return the response using TTS
132171

133172
Then, build the **Text To Speech (TTS)** function in order to transform the written answer into oral reply:
134173

135-
**What to do?**
136-
137-
- The LLM response is retrieved
138-
- A call is made to the TTS AI endpoint named `nvr-tts-en-us`
139-
- The audio sample and the sample rate are returned to play the audio automatical
140-
141174
```python
142-
def tts_synthesis(response):
143-
144-
tts_service = riva.client.SpeechSynthesisService(
145-
riva.client.Auth(uri=os.environ.get('TTS_GRPC_ENDPOINT'), use_ssl=True, metadata_args=[["authorization", f"bearer {os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}"]])
146-
)
147-
175+
def tts_synthesis(response, tts_client):
176+
148177
# set up config
149178
sample_rate_hz = 48000
150179
req = {
@@ -153,31 +182,37 @@ def tts_synthesis(response):
153182
"sample_rate_hz" : sample_rate_hz, # sample rate: 48KHz audio
154183
"voice_name" : "English-US.Female-1" # voices: `English-US.Female-1`, `English-US.Male-1`
155184
}
156-
185+
157186
# return response
158187
req["text"] = response
159-
response = tts_service.synthesize(**req)
160-
161-
return np.frombuffer(response.audio, dtype=np.int16), sample_rate_hz
188+
synthesized_response = tts_client.synthesize(**req)
189+
190+
return np.frombuffer(synthesized_response.audio, dtype=np.int16), sample_rate_hz
162191
```
163192

193+
**In this function:**
194+
195+
- The LLM response is retrieved.
196+
- A call is made to the TTS AI endpoint named `nvr-tts-en-us`.
197+
- The audio sample and the sample rate are returned to play the audio automatically.
198+
164199
⚡️ You're almost there! The final step is to build your web app, making your solution easy to use with just a few lines of code.
165200

166201
### Build the LLM chat app with Streamlit
167202

168-
In this last step, create the chatbot app using [Mixtral8x7B](https://endpoints.ai.cloud.ovh.net/models/e2ecb4a7-98d5-420d-9789-e0aa6ddf0ffc) endpoint (or any other model) and [Streamlit](https://streamlit.io/), an open-source Python library that allows to quickly create user interfaces for Machine Learning models and demos. Here is a working code example:
203+
In this last step, create the chatbot app using [Streamlit](https://streamlit.io/), an open-source Python library that allows to quickly create user interfaces for Machine Learning models and demos. Here is a working code example:
169204

170205
```python
171206
# streamlit interface
172207
with st.container():
173208
st.title("💬 Audio Virtual Assistant Chatbot")
174-
209+
175210
with st.container(height=600):
176211
messages = st.container()
177-
212+
178213
if "messages" not in st.session_state:
179214
st.session_state["messages"] = [{"role": "system", "content": "Hello, I'm AVA!", "avatar":"🤖"}]
180-
215+
181216
for msg in st.session_state.messages:
182217
messages.chat_message(msg["role"], avatar=msg["avatar"]).write(msg["content"])
183218

@@ -191,26 +226,19 @@ with st.container():
191226
use_container_width=True,
192227
key='recorder'
193228
)
194-
229+
195230
if recording:
196-
user_question = asr_transcription(recording['bytes'])
197-
231+
user_question = asr_transcription(recording['bytes'], asr_client)
232+
198233
if prompt := user_question:
199-
client = OpenAI(base_url=os.getenv("LLM_AI_ENDPOINT"), api_key=os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN'))
200234
st.session_state.messages.append({"role": "user", "content": prompt, "avatar":"👤"})
201235
messages.chat_message("user", avatar="👤").write(prompt)
202-
response = client.chat.completions.create(
203-
model="Mixtral-8x7B-Instruct-v0.1",
204-
messages=st.session_state.messages,
205-
temperature=0,
206-
max_tokens=1024,
207-
)
208-
msg = response.choices[0].message.content
209-
st.session_state.messages.append({"role": "system", "content": msg, "avatar": "🤖"})
236+
msg = llm_answer(st.session_state.messages, llm_client)
237+
st.session_state.messages.append({"role": "assistant", "content": msg, "avatar": "🤖"})
210238
messages.chat_message("system", avatar="🤖").write(msg)
211239

212240
if msg is not None:
213-
audio_samples, sample_rate_hz = tts_synthesis(msg)
241+
audio_samples, sample_rate_hz = tts_synthesis(msg, tts_client)
214242
placeholder.audio(audio_samples, sample_rate=sample_rate_hz, autoplay=True)
215243
```
216244

@@ -219,11 +247,53 @@ with st.container():
219247
🚀 That’s it! Now your web app is ready to be used! You can start this Streamlit app locally by launching the following command:
220248

221249
```python
222-
streamlit run audio-virtual-assistant.py
250+
streamlit run audio-virtual-assistant-app.py
223251
```
224252

225253
![app-overview](images/app_overview.png)
226254

255+
### Improvements
256+
257+
By default, the `nvr-tts-en-us` model supports only a limited number of characters per request when generating audio. If you exceed this limit, you will encounter errors in your application.
258+
259+
To work around this limitation, you can replace the existing `tts_synthesis` function with the following implementation, which processes text in chunks:
260+
261+
```python
262+
def tts_synthesis(response, tts_client):
263+
# Split response into chunks of max 1000 characters
264+
max_chunk_length = 1000
265+
words = response.split()
266+
chunks = []
267+
current_chunk = ""
268+
269+
for word in words:
270+
if len(current_chunk) + len(word) + 1 <= max_chunk_length:
271+
current_chunk += " " + word if current_chunk else word
272+
else:
273+
chunks.append(current_chunk)
274+
current_chunk = word
275+
if current_chunk:
276+
chunks.append(current_chunk)
277+
278+
all_audio = np.array([], dtype=np.int16)
279+
sample_rate_hz = 16000
280+
281+
# Process each chunk and concatenate the resulting audio
282+
for text in chunks:
283+
req = {
284+
"language_code": "en-US",
285+
"encoding": riva.client.AudioEncoding.LINEAR_PCM,
286+
"sample_rate_hz": sample_rate_hz,
287+
"voice_name": "English-US.Female-1",
288+
"text": text.strip(),
289+
}
290+
synthesized = tts_client.synthesize(**req)
291+
audio_segment = np.frombuffer(synthesized.audio, dtype=np.int16)
292+
all_audio = np.concatenate((all_audio, audio_segment))
293+
294+
return all_audio, sample_rate_hz
295+
```
296+
227297
## Conclusion
228298

229299
You’ve just created an Audio Virtual Assistant capable of natural conversation using voice, powered by ASR, LLM, and TTS endpoints.
@@ -245,4 +315,4 @@ If you need training or technical assistance to implement our solutions, contact
245315

246316
Please feel free to send us your questions, feedback, and suggestions regarding AI Endpoints and its features:
247317

248-
- In the #ai-endpoints channel of the OVHcloud [Discord server](https://discord.gg/ovhcloud), where you can engage with the community and OVHcloud team members.
318+
- In the #ai-endpoints channel of the OVHcloud [Discord server](https://discord.gg/ovhcloud), where you can engage with the community and OVHcloud team members.

0 commit comments

Comments
 (0)