feat: Add OpenAI-compatible audio endpoints for frontend device integration by aj47 · Pull Request #911 · aj47/SpeakMCP

aj47 · 2026-01-10T04:25:49Z

Summary

Adds OpenAI-compatible audio endpoints to the remote server, enabling frontend devices to send and receive audio instead of text.

New Endpoints

1. `POST /v1/audio/transcriptions` (OpenAI-compatible)

Transcribes audio to text using the configured STT provider (OpenAI Whisper or Groq).

Request: multipart/form-data

Field	Type	Required	Description
`file`	File	Yes	Audio file (mp3, mp4, m4a, wav, webm, ogg, flac)
`model`	String	No	Model to use (default: `whisper-1` or `whisper-large-v3-turbo` for Groq)
`language`	String	No	ISO-639-1 language code (e.g., `en`, `es`, `fr`)
`prompt`	String	No	Optional context/prompt to guide transcription
`response_format`	String	No	`json` (default), `text`, or `verbose_json`

Example:

curl -X POST http://localhost:3210/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@recording.mp3" \
  -F "language=en"

Response:

{ "text": "Hello, how can I help you today?" }

2. `POST /v1/audio/speech` (OpenAI-compatible)

Generates speech audio from text using the configured TTS provider (OpenAI, Groq, or Gemini).

Request: application/json

Field	Type	Required	Description
`input`	String	Yes	Text to convert to speech (max ~4096 chars recommended)
`model`	String	No	TTS model (default from config)
`voice`	String	No	Voice ID (e.g., `alloy`, `echo`, `nova` for OpenAI)
`speed`	Number	No	Speed multiplier 0.25-4.0 (OpenAI only)
`response_format`	String	No	`mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`

Example:

curl -X POST http://localhost:3210/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world", "voice": "alloy"}' \
  --output speech.mp3

Response: Binary audio file with appropriate Content-Type header

3. `POST /v1/audio/chat` (Custom Combined Endpoint)

All-in-one endpoint: Send audio → Transcribe → Run agent → Return text + optional audio response.

Request: multipart/form-data

Field	Type	Required	Description
`file`	File	Yes	Audio file to transcribe and process
`conversation_id`	String	No	Continue existing conversation
`return_audio`	Boolean	No	Include TTS audio in response (`true`/`false`)
`stream`	Boolean	No	Use SSE streaming for real-time updates
`language`	String	No	STT language code
`stt_model`	String	No	Override STT model
`tts_model`	String	No	Override TTS model
`voice`	String	No	TTS voice for audio response

Example (non-streaming):

curl -X POST http://localhost:3210/v1/audio/chat \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@question.mp3" \
  -F "return_audio=true" \
  -F "conversation_id=conv_abc123"

Response (non-streaming):

{
  "transcription": "What tools do you have available?",
  "content": "I have access to several tools including...",
  "conversation_id": "conv_abc123",
  "conversation_history": [...],
  "model": "gpt-4o",
  "audio": "base64_encoded_audio_data",
  "audio_content_type": "audio/mpeg"
}

Example (streaming with SSE):

curl -X POST http://localhost:3210/v1/audio/chat \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@question.mp3" \
  -F "stream=true" \
  -F "return_audio=true"

SSE Events:

data: {"type": "transcription", "data": {"text": "What tools do you have?"}}

data: {"type": "progress", "data": {"step": "thinking", "message": "Processing..."}}

data: {"type": "progress", "data": {"step": "tool_call", "toolName": "list_tools"}}

data: {"type": "done", "data": {"transcription": "...", "content": "...", "conversation_id": "...", "audio": "base64...", "audio_content_type": "audio/mpeg"}}

Configuration

The endpoints use the existing SpeakMCP configuration:

STT Provider: sttProviderId (openai or groq)
TTS Provider: ttsProviderId (openai, groq, or gemini)
API Keys: Uses configured keys for each provider
Base URLs: Respects custom base URLs if configured

Limits

Max file size: 25MB (matches OpenAI)
Supported formats: mp3, mp4, m4a, wav, webm, ogg, flac

Authentication

All endpoints require the same Bearer token authentication as other remote server endpoints:

Authorization: Bearer YOUR_REMOTE_SERVER_API_KEY

Error Handling

All endpoints return JSON errors:

{
  "error": "Error message here"
}

Common HTTP status codes:

400 - Bad request (missing file, invalid input)
401 - Unauthorized (invalid/missing API key)
500 - Server error (transcription/TTS failed)

Testing

# Test transcription
curl -X POST http://localhost:3210/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@test.mp3"

# Test speech generation
curl -X POST http://localhost:3210/v1/audio/speech \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "Testing TTS"}' \
  --output test-output.mp3

# Test full audio chat
curl -X POST http://localhost:3210/v1/audio/chat \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@test.mp3" \
  -F "return_audio=true"

Pull Request opened by Augment Code with guidance from the PR author

Add three new audio endpoints for frontend device integration: - POST /v1/audio/transcriptions - Speech-to-text (OpenAI-compatible) - POST /v1/audio/speech - Text-to-speech (OpenAI-compatible) - POST /v1/audio/chat - Combined audio-in -> agent -> audio-out Features: - Uses configured STT provider (OpenAI Whisper or Groq) - Uses configured TTS provider (OpenAI, Groq, or Gemini) - Supports streaming mode with SSE for real-time updates - Conversation continuity via conversation_id - 25MB file size limit (matches OpenAI) - Supports mp3, mp4, m4a, wav, webm, ogg, flac formats

augmentcode · 2026-01-10T04:39:07Z

🤖 Augment PR Summary

Summary: Adds OpenAI-compatible audio endpoints to the desktop remote server so frontend devices can send audio for STT and receive synthesized audio for TTS.

Changes:

Adds @fastify/multipart and configures a 25MB upload limit for audio files.
Introduces a transcription helper that forwards audio to the configured STT provider (OpenAI or Groq Whisper), including optional model/language/prompt.
Introduces a speech-generation helper for OpenAI/Groq/Gemini TTS, reusing existing TTS preprocessing + validation.
Adds POST /v1/audio/transcriptions and POST /v1/audio/speech endpoints compatible with OpenAI’s API shapes.
Adds a combined POST /v1/audio/chat endpoint (STT → agent → optional TTS) with optional SSE streaming and optional base64 audio payload.

Technical Notes: Endpoints reuse the existing remote-server Bearer token auth and respect configured provider base URLs and per-request overrides (model/voice/etc.).

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-10T04:39:08Z

apps/desktop/src/main/remote-server.ts

+  const model = options?.model ||
+    (config.sttProviderId === "groq" ? "whisper-large-v3-turbo" : "whisper-1")
+  form.append("model", model)
+  form.append("response_format", options?.response_format || "json")


transcribeAudio() forwards response_format to the upstream provider, but the helper always parses the response via transcriptResponse.json(). If a client requests response_format=text (which this PR advertises), the upstream response won’t be JSON and this will likely throw/500.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-10T04:39:09Z

apps/desktop/src/main/remote-server.ts

+    {
+      method: "POST",
+      headers: {
+        Authorization: `Bearer ${config.sttProviderId === "groq" ? config.groqApiKey : config.openaiApiKey}`,


If groqApiKey/openaiApiKey is unset, this sends Authorization: Bearer undefined to the upstream STT provider. That can produce confusing failures (and can break OpenAI-compatible servers that expect no auth header).

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode bot reviewed Jan 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add OpenAI-compatible audio endpoints for frontend device integration#911

feat: Add OpenAI-compatible audio endpoints for frontend device integration#911
aj47 wants to merge 1 commit intomainfrom
feature/audio-endpoints

aj47 commented Jan 10, 2026

Uh oh!

augmentcode bot commented Jan 10, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Jan 10, 2026

Uh oh!

augmentcode bot Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aj47 commented Jan 10, 2026

Summary

New Endpoints

1. POST /v1/audio/transcriptions (OpenAI-compatible)

2. POST /v1/audio/speech (OpenAI-compatible)

3. POST /v1/audio/chat (Custom Combined Endpoint)

Configuration

Limits

Authentication

Error Handling

Testing

Uh oh!

augmentcode bot commented Jan 10, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `POST /v1/audio/transcriptions` (OpenAI-compatible)

2. `POST /v1/audio/speech` (OpenAI-compatible)

3. `POST /v1/audio/chat` (Custom Combined Endpoint)