Skip to content

Commit 7ffcb3b

Browse files
sowu880pamelafox
andauthored
Add speech recognizer and synthesis on browser interface (#113)
* update * edit website * update * update * update * update * update * fix bug * fix bug * update * Update app.py * fix bug * fix bug * update * update * update * update * merge * update * update * update * Add documentation * Skip types for speech * Optionality * Make test more flexible * Update e2e * Rm test results * Improve speech check * Update speech tests * fix e2e * Split input/output * More precise env vars, tests * full test coverage * More consistency between Chat/Ask * Revert unneeded changes * Add link to AAD token docs * Add more parameters to be able to reuse existing resources * Revert unneeded change --------- Co-authored-by: Pamela Fox <[email protected]> Co-authored-by: Pamela Fox <[email protected]>
1 parent 69b6e8a commit 7ffcb3b

33 files changed

+706
-34
lines changed

.azdo/pipelines/azure-dev.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,12 @@ steps:
8686
AZURE_COMPUTER_VISION_RESOURCE_GROUP: $(AZURE_COMPUTER_VISION_RESOURCE_GROUP)
8787
AZURE_COMPUTER_VISION_LOCATION: $(AZURE_COMPUTER_VISION_LOCATION)
8888
AZURE_COMPUTER_VISION_SKU: $(AZURE_COMPUTER_VISION_SKU)
89+
USE_SPEECH_INPUT_BROWSER: $(USE_SPEECH_INPUT_BROWSER)
90+
USE_SPEECH_OUTPUT_AZURE: $(USE_SPEECH_OUTPUT_AZURE)
91+
AZURE_SPEECH_SERVICE: $(AZURE_SPEECH_SERVICE)
92+
AZURE_SPEECH_SERVICE_RESOURCE_GROUP: $(AZURE_SPEECH_SERVICE_RESOURCE_GROUP)
93+
AZURE_SPEECH_SERVICE_LOCATION: $(AZURE_SPEECH_SERVICE_LOCATION)
94+
AZURE_SPEECH_SERVICE_SKU: $(AZURE_SPEECH_SERVICE_SKU)
8995
AZURE_KEY_VAULT_NAME: $(AZURE_KEY_VAULT_NAME)
9096
AZURE_USE_AUTHENTICATION: $(AZURE_USE_AUTHENTICATION)
9197
AZURE_ENFORCE_ACCESS_CONTROL: $(AZURE_ENFORCE_ACCESS_CONTROL)

.github/workflows/azure-dev.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,12 @@ jobs:
7373
USE_GPT4V: ${{ vars.USE_GPT4V }}
7474
AZURE_VISION_ENDPOINT: ${{ vars.AZURE_VISION_ENDPOINT }}
7575
VISION_SECRET_NAME: ${{ vars.VISION_SECRET_NAME }}
76+
USE_SPEECH_INPUT_BROWSER: ${{ vars.USE_SPEECH_INPUT_BROWSER }}
77+
USE_SPEECH_OUTPUT_AZURE: ${{ vars.USE_SPEECH_OUTPUT_AZURE }}
78+
AZURE_SPEECH_SERVICE: ${{ vars.AZURE_SPEECH_SERVICE }}
79+
AZURE_SPEECH_SERVICE_RESOURCE_GROUP: ${{ vars.AZURE_SPEECH_RESOURCE_GROUP }}
80+
AZURE_SPEECH_SERVICE_LOCATION: ${{ vars.AZURE_SPEECH_SERVICE_LOCATION }}
81+
AZURE_SPEECH_SERVICE_SKU: ${{ vars.AZURE_SPEECH_SERVICE_SKU }}
7682
AZURE_KEY_VAULT_NAME: ${{ vars.AZURE_KEY_VAULT_NAME }}
7783
AZURE_USE_AUTHENTICATION: ${{ vars.AZURE_USE_AUTHENTICATION }}
7884
AZURE_ENFORCE_ACCESS_CONTROL: ${{ vars.AZURE_ENFORCE_ACCESS_CONTROL }}

.github/workflows/python-test.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,11 +63,11 @@ jobs:
6363
id: e2e
6464
if: runner.os != 'Windows'
6565
run: |
66-
playwright install --with-deps
66+
playwright install chromium --with-deps
6767
python3 -m pytest tests/e2e.py --tracing=retain-on-failure
6868
- name: Upload test artifacts
6969
if: ${{ failure() && steps.e2e.conclusion == 'failure' }}
7070
uses: actions/upload-artifact@v4
7171
with:
72-
name: playwright-traces
72+
name: playwright-traces${{ matrix.python_version }}
7373
path: test-results

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@
2727
- [Troubleshooting](#troubleshooting)
2828
- [Resources](#resources)
2929

30-
3130
[![Open in GitHub Codespaces](https://img.shields.io/static/v1?style=for-the-badge&label=GitHub+Codespaces&message=Open&color=brightgreen&logo=github)](https://github.com/codespaces/new?hide_repo_select=true&ref=main&repo=599293758&machine=standardLinux32gb&devcontainer_path=.devcontainer%2Fdevcontainer.json&location=WestUs2)
3231
[![Open in Dev Containers](https://img.shields.io/static/v1?style=for-the-badge&label=Dev%20Containers&message=Open&color=blue&logo=visualstudiocode)](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/azure-samples/azure-search-openai-demo)
3332

@@ -39,11 +38,14 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
3938

4039
## Features
4140

42-
* Chat and Q&A interfaces
43-
* Explores various options to help users evaluate the trustworthiness of responses with citations, tracking of source content, etc.
44-
* Shows possible approaches for data preparation, prompt construction, and orchestration of interaction between model (OpenAI) and retriever (AI Search)
45-
* Settings directly in the UX to tweak the behavior and experiment with options
46-
* Performance tracing and monitoring with Application Insights
41+
- Chat (multi-turn) and Q&A (single turn) interfaces
42+
- Renders citations and thought process for each answer
43+
- Includes settings directly in the UI to tweak the behavior and experiment with options
44+
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [integrated vectorization](/docs/data_ingestion.md#overview-of-integrated-vectorization)
45+
- Optional usage of [GPT-4 with vision](/docs/gpt4vision.md) to reason over image-heavy documents
46+
- Optional addition of [speech input/output](/docs/deploy_features.md#enabling-speech-inputoutput) for accessibility
47+
- Optional automation of [user login and data access](/docs/login_and_acl.md) via Microsoft Entra
48+
- Performance tracing and monitoring with Application Insights
4749

4850
![Chat screen](docs/images/chatscreen.png)
4951

app/backend/app.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,17 @@
44
import logging
55
import mimetypes
66
import os
7+
import time
78
from pathlib import Path
89
from typing import Any, AsyncGenerator, Dict, Union, cast
910

11+
from azure.cognitiveservices.speech import (
12+
ResultReason,
13+
SpeechConfig,
14+
SpeechSynthesisOutputFormat,
15+
SpeechSynthesisResult,
16+
SpeechSynthesizer,
17+
)
1018
from azure.core.exceptions import ResourceNotFoundError
1119
from azure.identity.aio import DefaultAzureCredential, get_bearer_token_provider
1220
from azure.monitor.opentelemetry import configure_azure_monitor
@@ -48,11 +56,18 @@
4856
CONFIG_BLOB_CONTAINER_CLIENT,
4957
CONFIG_CHAT_APPROACH,
5058
CONFIG_CHAT_VISION_APPROACH,
59+
CONFIG_CREDENTIAL,
5160
CONFIG_GPT4V_DEPLOYED,
5261
CONFIG_INGESTER,
5362
CONFIG_OPENAI_CLIENT,
5463
CONFIG_SEARCH_CLIENT,
5564
CONFIG_SEMANTIC_RANKER_DEPLOYED,
65+
CONFIG_SPEECH_INPUT_ENABLED,
66+
CONFIG_SPEECH_OUTPUT_ENABLED,
67+
CONFIG_SPEECH_SERVICE_ID,
68+
CONFIG_SPEECH_SERVICE_LOCATION,
69+
CONFIG_SPEECH_SERVICE_TOKEN,
70+
CONFIG_SPEECH_SERVICE_VOICE,
5671
CONFIG_USER_BLOB_CONTAINER_CLIENT,
5772
CONFIG_USER_UPLOAD_ENABLED,
5873
CONFIG_VECTOR_SEARCH_ENABLED,
@@ -229,10 +244,56 @@ def config():
229244
"showSemanticRankerOption": current_app.config[CONFIG_SEMANTIC_RANKER_DEPLOYED],
230245
"showVectorOption": current_app.config[CONFIG_VECTOR_SEARCH_ENABLED],
231246
"showUserUpload": current_app.config[CONFIG_USER_UPLOAD_ENABLED],
247+
"showSpeechInput": current_app.config[CONFIG_SPEECH_INPUT_ENABLED],
248+
"showSpeechOutput": current_app.config[CONFIG_SPEECH_OUTPUT_ENABLED],
232249
}
233250
)
234251

235252

253+
@bp.route("/speech", methods=["POST"])
254+
async def speech():
255+
if not request.is_json:
256+
return jsonify({"error": "request must be json"}), 415
257+
258+
speech_token = current_app.config.get(CONFIG_SPEECH_SERVICE_TOKEN)
259+
if speech_token is None or speech_token.expires_on < time.time() + 60:
260+
speech_token = await current_app.config[CONFIG_CREDENTIAL].get_token(
261+
"https://cognitiveservices.azure.com/.default"
262+
)
263+
current_app.config[CONFIG_SPEECH_SERVICE_TOKEN] = speech_token
264+
265+
request_json = await request.get_json()
266+
text = request_json["text"]
267+
try:
268+
# Construct a token as described in documentation:
269+
# https://learn.microsoft.com/azure/ai-services/speech-service/how-to-configure-azure-ad-auth?pivots=programming-language-python
270+
auth_token = (
271+
"aad#"
272+
+ current_app.config[CONFIG_SPEECH_SERVICE_ID]
273+
+ "#"
274+
+ current_app.config[CONFIG_SPEECH_SERVICE_TOKEN].token
275+
)
276+
speech_config = SpeechConfig(auth_token=auth_token, region=current_app.config[CONFIG_SPEECH_SERVICE_LOCATION])
277+
speech_config.speech_synthesis_voice_name = current_app.config[CONFIG_SPEECH_SERVICE_VOICE]
278+
speech_config.speech_synthesis_output_format = SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3
279+
synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)
280+
result: SpeechSynthesisResult = synthesizer.speak_text_async(text).get()
281+
if result.reason == ResultReason.SynthesizingAudioCompleted:
282+
return result.audio_data, 200, {"Content-Type": "audio/mp3"}
283+
elif result.reason == ResultReason.Canceled:
284+
cancellation_details = result.cancellation_details
285+
current_app.logger.error(
286+
"Speech synthesis canceled: %s %s", cancellation_details.reason, cancellation_details.error_details
287+
)
288+
raise Exception("Speech synthesis canceled. Check logs for details.")
289+
else:
290+
current_app.logger.error("Unexpected result reason: %s", result.reason)
291+
raise Exception("Speech synthesis failed. Check logs for details.")
292+
except Exception as e:
293+
logging.exception("Exception in /speech")
294+
return jsonify({"error": str(e)}), 500
295+
296+
236297
@bp.post("/upload")
237298
@authenticated
238299
async def upload(auth_claims: dict[str, Any]):
@@ -337,8 +398,14 @@ async def setup_clients():
337398
AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER", "lexicon")
338399
AZURE_SEARCH_SEMANTIC_RANKER = os.getenv("AZURE_SEARCH_SEMANTIC_RANKER", "free").lower()
339400

401+
AZURE_SPEECH_SERVICE_ID = os.getenv("AZURE_SPEECH_SERVICE_ID")
402+
AZURE_SPEECH_SERVICE_LOCATION = os.getenv("AZURE_SPEECH_SERVICE_LOCATION")
403+
AZURE_SPEECH_VOICE = os.getenv("AZURE_SPEECH_VOICE", "en-US-AndrewMultilingualNeural")
404+
340405
USE_GPT4V = os.getenv("USE_GPT4V", "").lower() == "true"
341406
USE_USER_UPLOAD = os.getenv("USE_USER_UPLOAD", "").lower() == "true"
407+
USE_SPEECH_INPUT_BROWSER = os.getenv("USE_SPEECH_INPUT_BROWSER", "").lower() == "true"
408+
USE_SPEECH_OUTPUT_AZURE = os.getenv("USE_SPEECH_OUTPUT_AZURE", "").lower() == "true"
342409

343410
# Use the current user identity to authenticate with Azure OpenAI, AI Search and Blob Storage (no secrets needed,
344411
# just use 'az login' locally, and managed identity when deployed on Azure). If you need to use keys, use separate AzureKeyCredential instances with the
@@ -421,6 +488,18 @@ async def setup_clients():
421488
# Used by the OpenAI SDK
422489
openai_client: AsyncOpenAI
423490

491+
if USE_SPEECH_OUTPUT_AZURE:
492+
if not AZURE_SPEECH_SERVICE_ID or AZURE_SPEECH_SERVICE_ID == "":
493+
raise ValueError("Azure speech resource not configured correctly, missing AZURE_SPEECH_SERVICE_ID")
494+
if not AZURE_SPEECH_SERVICE_LOCATION or AZURE_SPEECH_SERVICE_LOCATION == "":
495+
raise ValueError("Azure speech resource not configured correctly, missing AZURE_SPEECH_SERVICE_LOCATION")
496+
current_app.config[CONFIG_SPEECH_SERVICE_ID] = AZURE_SPEECH_SERVICE_ID
497+
current_app.config[CONFIG_SPEECH_SERVICE_LOCATION] = AZURE_SPEECH_SERVICE_LOCATION
498+
current_app.config[CONFIG_SPEECH_SERVICE_VOICE] = AZURE_SPEECH_VOICE
499+
# Wait until token is needed to fetch for the first time
500+
current_app.config[CONFIG_SPEECH_SERVICE_TOKEN] = None
501+
current_app.config[CONFIG_CREDENTIAL] = azure_credential
502+
424503
if OPENAI_HOST.startswith("azure"):
425504
token_provider = get_bearer_token_provider(azure_credential, "https://cognitiveservices.azure.com/.default")
426505

@@ -456,6 +535,8 @@ async def setup_clients():
456535
current_app.config[CONFIG_SEMANTIC_RANKER_DEPLOYED] = AZURE_SEARCH_SEMANTIC_RANKER != "disabled"
457536
current_app.config[CONFIG_VECTOR_SEARCH_ENABLED] = os.getenv("USE_VECTORS", "").lower() != "false"
458537
current_app.config[CONFIG_USER_UPLOAD_ENABLED] = bool(USE_USER_UPLOAD)
538+
current_app.config[CONFIG_SPEECH_INPUT_ENABLED] = USE_SPEECH_INPUT_BROWSER
539+
current_app.config[CONFIG_SPEECH_OUTPUT_ENABLED] = USE_SPEECH_OUTPUT_AZURE
459540

460541
# Various approaches to integrate GPT and external knowledge, most applications will use a single one of these patterns
461542
# or some derivative, here we include several for exploration purposes

app/backend/config.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,9 @@
1414
CONFIG_SEARCH_CLIENT = "search_client"
1515
CONFIG_OPENAI_CLIENT = "openai_client"
1616
CONFIG_INGESTER = "ingester"
17+
CONFIG_SPEECH_INPUT_ENABLED = "speech_input_enabled"
18+
CONFIG_SPEECH_OUTPUT_ENABLED = "speech_output_enabled"
19+
CONFIG_SPEECH_SERVICE_ID = "speech_service_id"
20+
CONFIG_SPEECH_SERVICE_LOCATION = "speech_service_location"
21+
CONFIG_SPEECH_SERVICE_TOKEN = "speech_service_token"
22+
CONFIG_SPEECH_SERVICE_VOICE = "speech_service_voice"

app/backend/requirements.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ openai[datalib]>=1.3.7
55
tiktoken
66
tenacity
77
azure-ai-documentintelligence
8+
azure-cognitiveservices-speech
89
azure-search-documents==11.6.0b1
910
azure-storage-blob
1011
azure-storage-file-datalake

app/backend/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ attrs==23.2.0
2424
# via aiohttp
2525
azure-ai-documentintelligence==1.0.0b3
2626
# via -r requirements.in
27+
azure-cognitiveservices-speech==1.37.0
28+
# via -r requirements.in
2729
azure-common==1.1.28
2830
# via azure-search-documents
2931
azure-core==1.30.1

app/frontend/package-lock.json

Lines changed: 13 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

app/frontend/package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
"prettier": "^3.0.3",
3636
"typescript": "^5.2.2",
3737
"@types/react-syntax-highlighter": "^15.5.7",
38+
"@types/dom-speech-recognition": "^0.0.4",
3839
"vite": "^4.5.3"
3940
}
4041
}

0 commit comments

Comments
 (0)