Skip to content

feat: Add OpenAI-compatible audio endpoints for frontend device integration#911

Open
aj47 wants to merge 1 commit intomainfrom
feature/audio-endpoints
Open

feat: Add OpenAI-compatible audio endpoints for frontend device integration#911
aj47 wants to merge 1 commit intomainfrom
feature/audio-endpoints

Conversation

@aj47
Copy link
Owner

@aj47 aj47 commented Jan 10, 2026

Summary

Adds OpenAI-compatible audio endpoints to the remote server, enabling frontend devices to send and receive audio instead of text.

New Endpoints

1. POST /v1/audio/transcriptions (OpenAI-compatible)

Transcribes audio to text using the configured STT provider (OpenAI Whisper or Groq).

Request: multipart/form-data

Field Type Required Description
file File Yes Audio file (mp3, mp4, m4a, wav, webm, ogg, flac)
model String No Model to use (default: whisper-1 or whisper-large-v3-turbo for Groq)
language String No ISO-639-1 language code (e.g., en, es, fr)
prompt String No Optional context/prompt to guide transcription
response_format String No json (default), text, or verbose_json

Example:

curl -X POST http://localhost:3210/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@recording.mp3" \
  -F "language=en"

Response:

{ "text": "Hello, how can I help you today?" }

2. POST /v1/audio/speech (OpenAI-compatible)

Generates speech audio from text using the configured TTS provider (OpenAI, Groq, or Gemini).

Request: application/json

Field Type Required Description
input String Yes Text to convert to speech (max ~4096 chars recommended)
model String No TTS model (default from config)
voice String No Voice ID (e.g., alloy, echo, nova for OpenAI)
speed Number No Speed multiplier 0.25-4.0 (OpenAI only)
response_format String No mp3, opus, aac, flac, wav, pcm

Example:

curl -X POST http://localhost:3210/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world", "voice": "alloy"}' \
  --output speech.mp3

Response: Binary audio file with appropriate Content-Type header


3. POST /v1/audio/chat (Custom Combined Endpoint)

All-in-one endpoint: Send audio → Transcribe → Run agent → Return text + optional audio response.

Request: multipart/form-data

Field Type Required Description
file File Yes Audio file to transcribe and process
conversation_id String No Continue existing conversation
return_audio Boolean No Include TTS audio in response (true/false)
stream Boolean No Use SSE streaming for real-time updates
language String No STT language code
stt_model String No Override STT model
tts_model String No Override TTS model
voice String No TTS voice for audio response

Example (non-streaming):

curl -X POST http://localhost:3210/v1/audio/chat \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@question.mp3" \
  -F "return_audio=true" \
  -F "conversation_id=conv_abc123"

Response (non-streaming):

{
  "transcription": "What tools do you have available?",
  "content": "I have access to several tools including...",
  "conversation_id": "conv_abc123",
  "conversation_history": [...],
  "model": "gpt-4o",
  "audio": "base64_encoded_audio_data",
  "audio_content_type": "audio/mpeg"
}

Example (streaming with SSE):

curl -X POST http://localhost:3210/v1/audio/chat \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@question.mp3" \
  -F "stream=true" \
  -F "return_audio=true"

SSE Events:

data: {"type": "transcription", "data": {"text": "What tools do you have?"}}

data: {"type": "progress", "data": {"step": "thinking", "message": "Processing..."}}

data: {"type": "progress", "data": {"step": "tool_call", "toolName": "list_tools"}}

data: {"type": "done", "data": {"transcription": "...", "content": "...", "conversation_id": "...", "audio": "base64...", "audio_content_type": "audio/mpeg"}}

Configuration

The endpoints use the existing SpeakMCP configuration:

  • STT Provider: sttProviderId (openai or groq)
  • TTS Provider: ttsProviderId (openai, groq, or gemini)
  • API Keys: Uses configured keys for each provider
  • Base URLs: Respects custom base URLs if configured

Limits

  • Max file size: 25MB (matches OpenAI)
  • Supported formats: mp3, mp4, m4a, wav, webm, ogg, flac

Authentication

All endpoints require the same Bearer token authentication as other remote server endpoints:

Authorization: Bearer YOUR_REMOTE_SERVER_API_KEY

Error Handling

All endpoints return JSON errors:

{
  "error": "Error message here"
}

Common HTTP status codes:

  • 400 - Bad request (missing file, invalid input)
  • 401 - Unauthorized (invalid/missing API key)
  • 500 - Server error (transcription/TTS failed)

Testing

# Test transcription
curl -X POST http://localhost:3210/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@test.mp3"

# Test speech generation
curl -X POST http://localhost:3210/v1/audio/speech \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "Testing TTS"}' \
  --output test-output.mp3

# Test full audio chat
curl -X POST http://localhost:3210/v1/audio/chat \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@test.mp3" \
  -F "return_audio=true"

Pull Request opened by Augment Code with guidance from the PR author

Add three new audio endpoints for frontend device integration:

- POST /v1/audio/transcriptions - Speech-to-text (OpenAI-compatible)
- POST /v1/audio/speech - Text-to-speech (OpenAI-compatible)
- POST /v1/audio/chat - Combined audio-in -> agent -> audio-out

Features:
- Uses configured STT provider (OpenAI Whisper or Groq)
- Uses configured TTS provider (OpenAI, Groq, or Gemini)
- Supports streaming mode with SSE for real-time updates
- Conversation continuity via conversation_id
- 25MB file size limit (matches OpenAI)
- Supports mp3, mp4, m4a, wav, webm, ogg, flac formats
@augmentcode
Copy link

augmentcode bot commented Jan 10, 2026

🤖 Augment PR Summary

Summary: Adds OpenAI-compatible audio endpoints to the desktop remote server so frontend devices can send audio for STT and receive synthesized audio for TTS.

Changes:

  • Adds @fastify/multipart and configures a 25MB upload limit for audio files.
  • Introduces a transcription helper that forwards audio to the configured STT provider (OpenAI or Groq Whisper), including optional model/language/prompt.
  • Introduces a speech-generation helper for OpenAI/Groq/Gemini TTS, reusing existing TTS preprocessing + validation.
  • Adds POST /v1/audio/transcriptions and POST /v1/audio/speech endpoints compatible with OpenAI’s API shapes.
  • Adds a combined POST /v1/audio/chat endpoint (STT → agent → optional TTS) with optional SSE streaming and optional base64 audio payload.

Technical Notes: Endpoints reuse the existing remote-server Bearer token auth and respect configured provider base URLs and per-request overrides (model/voice/etc.).

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

const model = options?.model ||
(config.sttProviderId === "groq" ? "whisper-large-v3-turbo" : "whisper-1")
form.append("model", model)
form.append("response_format", options?.response_format || "json")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transcribeAudio() forwards response_format to the upstream provider, but the helper always parses the response via transcriptResponse.json(). If a client requests response_format=text (which this PR advertises), the upstream response won’t be JSON and this will likely throw/500.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

{
method: "POST",
headers: {
Authorization: `Bearer ${config.sttProviderId === "groq" ? config.groqApiKey : config.openaiApiKey}`,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If groqApiKey/openaiApiKey is unset, this sends Authorization: Bearer undefined to the upstream STT provider. That can produce confusing failures (and can break OpenAI-compatible servers that expect no auth header).

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant