feat: Add OpenAI-compatible audio endpoints for frontend device integration#911
Open
feat: Add OpenAI-compatible audio endpoints for frontend device integration#911
Conversation
Add three new audio endpoints for frontend device integration: - POST /v1/audio/transcriptions - Speech-to-text (OpenAI-compatible) - POST /v1/audio/speech - Text-to-speech (OpenAI-compatible) - POST /v1/audio/chat - Combined audio-in -> agent -> audio-out Features: - Uses configured STT provider (OpenAI Whisper or Groq) - Uses configured TTS provider (OpenAI, Groq, or Gemini) - Supports streaming mode with SSE for real-time updates - Conversation continuity via conversation_id - 25MB file size limit (matches OpenAI) - Supports mp3, mp4, m4a, wav, webm, ogg, flac formats
🤖 Augment PR SummarySummary: Adds OpenAI-compatible audio endpoints to the desktop remote server so frontend devices can send audio for STT and receive synthesized audio for TTS. Changes:
Technical Notes: Endpoints reuse the existing remote-server Bearer token auth and respect configured provider base URLs and per-request overrides (model/voice/etc.). 🤖 Was this summary useful? React with 👍 or 👎 |
| const model = options?.model || | ||
| (config.sttProviderId === "groq" ? "whisper-large-v3-turbo" : "whisper-1") | ||
| form.append("model", model) | ||
| form.append("response_format", options?.response_format || "json") |
There was a problem hiding this comment.
transcribeAudio() forwards response_format to the upstream provider, but the helper always parses the response via transcriptResponse.json(). If a client requests response_format=text (which this PR advertises), the upstream response won’t be JSON and this will likely throw/500.
🤖 Was this useful? React with 👍 or 👎
| { | ||
| method: "POST", | ||
| headers: { | ||
| Authorization: `Bearer ${config.sttProviderId === "groq" ? config.groqApiKey : config.openaiApiKey}`, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds OpenAI-compatible audio endpoints to the remote server, enabling frontend devices to send and receive audio instead of text.
New Endpoints
1.
POST /v1/audio/transcriptions(OpenAI-compatible)Transcribes audio to text using the configured STT provider (OpenAI Whisper or Groq).
Request:
multipart/form-datafilemodelwhisper-1orwhisper-large-v3-turbofor Groq)languageen,es,fr)promptresponse_formatjson(default),text, orverbose_jsonExample:
Response:
{ "text": "Hello, how can I help you today?" }2.
POST /v1/audio/speech(OpenAI-compatible)Generates speech audio from text using the configured TTS provider (OpenAI, Groq, or Gemini).
Request:
application/jsoninputmodelvoicealloy,echo,novafor OpenAI)speedresponse_formatmp3,opus,aac,flac,wav,pcmExample:
Response: Binary audio file with appropriate
Content-Typeheader3.
POST /v1/audio/chat(Custom Combined Endpoint)All-in-one endpoint: Send audio → Transcribe → Run agent → Return text + optional audio response.
Request:
multipart/form-datafileconversation_idreturn_audiotrue/false)streamlanguagestt_modeltts_modelvoiceExample (non-streaming):
Response (non-streaming):
{ "transcription": "What tools do you have available?", "content": "I have access to several tools including...", "conversation_id": "conv_abc123", "conversation_history": [...], "model": "gpt-4o", "audio": "base64_encoded_audio_data", "audio_content_type": "audio/mpeg" }Example (streaming with SSE):
SSE Events:
Configuration
The endpoints use the existing SpeakMCP configuration:
sttProviderId(openaiorgroq)ttsProviderId(openai,groq, orgemini)Limits
Authentication
All endpoints require the same Bearer token authentication as other remote server endpoints:
Error Handling
All endpoints return JSON errors:
{ "error": "Error message here" }Common HTTP status codes:
400- Bad request (missing file, invalid input)401- Unauthorized (invalid/missing API key)500- Server error (transcription/TTS failed)Testing
Pull Request opened by Augment Code with guidance from the PR author