|
1 | 1 | # FAQ — ovos-stt-http-server |
2 | 2 |
|
| 3 | +## General |
| 4 | + |
| 5 | +**Q: What is ovos-stt-http-server?** |
| 6 | +A: A FastAPI-based HTTP server that wraps any OVOS STT plugin and exposes it via a REST API. It also provides five vendor-compatible compat routers for drop-in use with OpenAI Whisper, Deepgram, Google Cloud STT, AssemblyAI, and Speechmatics clients. |
| 7 | + |
| 8 | +**Q: What is the default port?** |
| 9 | +A: The CLI defaults to `8080`. Override with `--port <number>`. |
| 10 | + |
| 11 | +**Q: How do I start the server?** |
| 12 | +A: `ovos-stt-server --engine <plugin-name> --port 8080`. The `--engine` flag is required. |
| 13 | + |
| 14 | +**Q: Do I need API keys or credentials?** |
| 15 | +A: No. All vendor-compatible routers accept auth headers and API-key query parameters but silently ignore them. No credentials are validated. |
| 16 | + |
| 17 | +**Q: What Python version is required?** |
| 18 | +A: Python 3.9 or later (pyproject.toml `requires-python = ">=3.9"`). |
| 19 | + |
3 | 20 | **Q: What audio format does `/stt` expect?** |
4 | | -Raw PCM bytes: 16 kHz, mono, 16-bit signed integer (int16). Pass `sample_rate` and `sample_width` query params if your audio differs from the defaults. |
| 21 | +A: Raw PCM bytes: 16 kHz, mono, 16-bit signed integer (int16). Pass `sample_rate` and `sample_width` query params if your audio differs from the defaults. |
5 | 22 |
|
6 | 23 | **Q: How do I configure CORS?** |
7 | | -CORS is unconditionally set to `allow_origins=["*"]`. There is no env-var override. All origins are permitted. See `create_app` — `ovos_stt_http_server/__init__.py:109`. |
| 24 | +A: CORS is unconditionally set to `allow_origins=["*"]`. There is no env-var override. All origins are permitted. See `create_app` — `ovos_stt_http_server/__init__.py:109`. |
8 | 25 |
|
9 | 26 | **Q: How do I enable automatic language detection?** |
10 | | -Pass `lang=auto` as a query parameter to `/stt`, or use the `/lang_detect` endpoint directly. A `lang_plugin` must be provided at startup (`--lang-engine`). |
| 27 | +A: Pass `lang=auto` as a query parameter to `/stt`, or use the `/lang_detect` endpoint directly. A `lang_plugin` must be provided at startup (`--lang-engine`). |
11 | 28 |
|
12 | 29 | **Q: What is `--multi` mode?** |
13 | | -`--multi` loads one `MultiModelContainer` (`__init__.py:57`) that instantiates a separate plugin instance per language code on first use. Useful for multilingual deployments with language-specific models. |
| 30 | +A: `--multi` loads one `MultiModelContainer` (`__init__.py:57`) that instantiates a separate plugin instance per language code on first use. Useful for multilingual deployments with language-specific models. |
14 | 31 |
|
15 | 32 | **Q: How do I specify the STT plugin?** |
16 | | -Pass `--engine <plugin-name>` to the CLI. The plugin must be installed and discoverable via `ovos-plugin-manager`. |
| 33 | +A: Pass `--engine <plugin-name>` to the CLI. The plugin must be installed and discoverable via `ovos-plugin-manager`. |
17 | 34 |
|
18 | 35 | **Q: What plugins are supported?** |
19 | | -Any plugin registered under the `opm.plugin.stt` entry point group. Install the plugin package and reference it by its entry point name. |
| 36 | +A: Any plugin registered under the `opm.plugin.stt` entry point group. Install the plugin package and reference it by its entry point name. |
20 | 37 |
|
21 | 38 | **Q: What does `/status` return?** |
22 | | -`{"status": "ok", "plugin": "<engine-name>", "lang_plugin": "<lang-engine-name-or-null>"}` — `stats` handler in `__init__.py:142`. |
| 39 | +A: `{"status": "ok", "plugin": "<engine-name>", "lang_plugin": "<lang-engine-name-or-null>"}` — `stats` handler in `__init__.py:142`. |
23 | 40 |
|
24 | 41 | **Q: Is Gradio UI supported?** |
25 | | -No. Gradio support was removed. The server is a pure REST API only. |
| 42 | +A: No. Gradio support was removed. The server is a pure REST API only. |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## OpenAI Whisper Compatible Clients |
| 47 | + |
| 48 | +**Q: Which OpenAI Whisper clients work with this server?** |
| 49 | +A: Any client that POSTs to `/v1/audio/transcriptions` or `/v1/audio/translations` with multipart form data works. This includes the official `openai` Python SDK, `whisper-client`, and raw `curl` commands. |
| 50 | + |
| 51 | +**Q: How do I use the OpenAI Python SDK against this server?** |
| 52 | +A: Set `base_url="http://localhost:8080/openai"` when constructing the `OpenAI` client. The `api_key` parameter is accepted but ignored. |
| 53 | + |
| 54 | +**Q: What `response_format` values are supported?** |
| 55 | +A: `json` (default), `text`, `srt`, `vtt`, and `verbose_json`. See [docs/response-formats.md](docs/response-formats.md). |
| 56 | + |
| 57 | +**Q: Does `verbose_json` return real word-level segments?** |
| 58 | +A: No. The `segments` field is always an empty list. `task`, `language`, `duration`, and `text` are populated. |
| 59 | + |
| 60 | +**Q: Does the translations endpoint really translate audio?** |
| 61 | +A: No — it calls the same STT engine as transcriptions but forces `language=en`. Translation between languages is not performed; the engine transcribes with English as the target hint. |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +## Deepgram Compatible Clients |
| 66 | + |
| 67 | +**Q: Which Deepgram clients work with this server?** |
| 68 | +A: Any client that POSTs raw audio bytes to `/v1/listen`. The official `deepgram-sdk` Python package works when its base URL is overridden. |
| 69 | + |
| 70 | +**Q: How is audio parsed for the Deepgram endpoint?** |
| 71 | +A: The raw request body is wrapped in `AudioData(body, 16000, 2)` — no format detection. Send WAV or raw PCM at 16 kHz 16-bit mono for best results. |
| 72 | + |
| 73 | +**Q: Does `punctuate=true` add punctuation?** |
| 74 | +A: No. The `punctuate` query parameter is accepted and ignored. Punctuation depends on the underlying STT plugin. |
| 75 | + |
| 76 | +**Q: What does the Deepgram `words` array contain?** |
| 77 | +A: An empty list. Word-level timing is not implemented. |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +## Google Speech-to-Text Compatible Clients |
| 82 | + |
| 83 | +**Q: Which Google STT clients work?** |
| 84 | +A: Any client that POSTs to `/v1/speech:recognize` with a JSON body containing `config` and `audio.content` (base64-encoded audio). |
| 85 | + |
| 86 | +**Q: Are GCS URIs (`gs://...`) supported?** |
| 87 | +A: No. The server returns HTTP 501 if `audio.uri` is set. Use `audio.content` with base64-encoded audio. |
| 88 | + |
| 89 | +**Q: Does the `encoding` field matter?** |
| 90 | +A: No — the server attempts to parse uploaded bytes as WAV regardless of the `encoding` field value, then falls back to raw PCM. |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## AssemblyAI Stub Behavior |
| 95 | + |
| 96 | +**Q: Why does the AssemblyAI GET transcript endpoint always return `status: error`?** |
| 97 | +A: This server is synchronous. Transcription completes in the POST response. No job store persists between requests, so GET by ID cannot retrieve prior results. |
| 98 | + |
| 99 | +**Q: Do I need to poll for results like the real AssemblyAI API?** |
| 100 | +A: No — the POST response already contains `status: completed` and the `text` field. Read the result directly from the POST response. |
| 101 | + |
| 102 | +**Q: What happens if I send `audio_url` instead of `audio`?** |
| 103 | +A: The server returns `status: error` with a message explaining that `audio_url` fetching is not supported. Encode your audio as base64 and put it in the `audio` field. |
| 104 | + |
| 105 | +**Q: Is the `id` in the POST response reusable?** |
| 106 | +A: No. The ID is a UUID generated per-request. The GET endpoint ignores it and always returns an error stub. |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +## Speechmatics Behavior |
| 111 | + |
| 112 | +**Q: How does the Speechmatics job model work on this server?** |
| 113 | +A: Job creation (POST `/v1/jobs`) transcribes immediately and stores the result in an in-memory dict keyed by job ID. GET retrieves from that dict. |
| 114 | + |
| 115 | +**Q: What happens if I GET a job that doesn't exist?** |
| 116 | +A: HTTP 404 is returned: `{"detail": "Job '<id>' not found."}`. |
| 117 | + |
| 118 | +**Q: Are job results preserved across server restarts?** |
| 119 | +A: No. The `_jobs` dict (`speechmatics.py:13`) is in-memory only. |
| 120 | + |
| 121 | +**Q: What `format` parameter does GET `/transcript` accept?** |
| 122 | +A: The `format` query param is accepted and ignored. The response is always Speechmatics JSON v2.9 format. |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## Audio Format |
| 127 | + |
| 128 | +**Q: What audio formats are supported?** |
| 129 | +A: WAV is supported natively via stdlib. MP3, OGG, FLAC, M4A, and WebM require `pydub` (`pip install pydub`). See [docs/audio-formats.md](docs/audio-formats.md). |
| 130 | + |
| 131 | +**Q: What happens if I upload a non-WAV file without pydub installed?** |
| 132 | +A: HTTP 501 is returned with a message indicating that the format requires pydub. |
| 133 | + |
| 134 | +**Q: What sample rate and bit depth should I use?** |
| 135 | +A: 16 kHz, mono, 16-bit (int16). The server resamples non-WAV files via pydub to match these parameters. |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## Language |
| 140 | + |
| 141 | +**Q: How do I specify the transcription language?** |
| 142 | +A: Each compat router has its own mechanism: `language` form field (Whisper), `?language=` query param (Deepgram), `config.languageCode` JSON field (Google), `language_code` JSON field (AssemblyAI), `transcription_config.language` in the job config JSON (Speechmatics). |
| 143 | + |
| 144 | +**Q: What happens if no language is specified?** |
| 145 | +A: Defaults vary per router: Deepgram defaults to `en`, AssemblyAI defaults to `en`, Speechmatics defaults to `en`, Whisper passes `None` → the engine receives `"auto"`. |
| 146 | + |
| 147 | +**Q: Does language auto-detection work with compat routers?** |
| 148 | +A: Not directly. Use the native `/lang_detect` endpoint, or start the server with `--lang-engine` to enable automatic language detection in the underlying engine. |
0 commit comments