@@ -91,6 +91,7 @@ Describe available services.
9191 * ` installed ` - true if currently installed (bool, required)
9292 * ` description ` - human-readable description (string, optional)
9393 * ` version ` - version of the model (string, optional)
94+ * ` supports_transcript_streaming ` - true if program can stream transcript chunks
9495 * ` tts ` - list text to speech services (optional)
9596 * ` models ` - list of available models
9697 * ` name ` - unique name (required)
@@ -103,6 +104,7 @@ Describe available services.
103104 * ` installed ` - true if currently installed (bool, required)
104105 * ` description ` - human-readable description (string, optional)
105106 * ` version ` - version of the model (string, optional)
107+ * ` supports_synthesize_streaming ` - true if program can stream text chunks
106108 * ` wake ` - list wake word detection services( optional )
107109 * ` models ` - list of available models (required)
108110 * ` name ` - unique name (required)
@@ -123,6 +125,7 @@ Describe available services.
123125 * ` installed ` - true if currently installed (bool, required)
124126 * ` description ` - human-readable description (string, optional)
125127 * ` version ` - version of the model (string, optional)
128+ * ` supports_handled_streaming ` - true if program can stream response chunks
126129 * ` intent ` - list intent recognition services (optional)
127130 * ` models ` - list of available models (required)
128131 * ` name ` - unique name (required)
@@ -160,8 +163,19 @@ Transcribe audio into text.
160163 * ` context ` - context from previous interactions (object, optional)
161164* ` transcript ` - response with transcription
162165 * ` text ` - text transcription of spoken audio (string, required)
166+ * ` language ` - language of transcript (string, optional)
163167 * ` context ` - context for next interaction (object, optional)
164168
169+ Streaming:
170+
171+ 1 . ` transcript-start ` - starts stream
172+ * ` language ` - language of transcript (string, optional)
173+ * ` context ` - context from previous interactions (object, optional)
174+ 2 . ` transcript-chunk `
175+ * ` text ` - part of transcript (string, required)
176+ 3 . Original ` transcript ` event must be sent for backwards compatibility
177+ 4 . ` transcript-stop ` - end of stream
178+
165179### Text to Speech
166180
167181Synthesize audio from text.
@@ -172,6 +186,20 @@ Synthesize audio from text.
172186 * ` name ` - name of voice (string, optional)
173187 * ` language ` - language of voice (string, optional)
174188 * ` speaker ` - speaker of voice (string, optional)
189+
190+ Streaming:
191+
192+ 1 . ` synthesize-start ` - starts stream
193+ * ` context ` - context from previous interactions (object, optional)
194+ * ` voice ` - use a specific voice (optional)
195+ * ` name ` - name of voice (string, optional)
196+ * ` language ` - language of voice (string, optional)
197+ * ` speaker ` - speaker of voice (string, optional)
198+ 2 . ` synthesize-chunk `
199+ * ` text ` - part of text to synthesize (string, required)
200+ 3 . Original ` synthesize ` message must be sent for backwards compatibility
201+ 4 . ` synthesize-stop ` - end of stream, final audio must be sent
202+ 5 . ` synthesize-stopped ` - sent back to server after final audio
175203
176204### Wake Word
177205
@@ -222,6 +250,15 @@ Handle structured intents or text directly.
222250 * ` text ` - response for user (string, optional)
223251 * ` context ` - context for next interactions (object, optional)
224252
253+ Streaming:
254+
255+ 1 . ` handled-start ` - starts stream
256+ * ` context ` - context from previous interactions (object, optional)
257+ 2 . ` handled-chunk `
258+ * ` text ` - part of response (string, required)
259+ 3 . Original ` handled ` message must be sent for backwards compatibility
260+ 4 . ` handled-stop ` - end of stream
261+
225262### Audio Output
226263
227264Play audio stream.
@@ -295,8 +332,23 @@ Pipelines are run on the server, but can be triggered remotely from the server a
2953323 . &rarr ; ` audio-chunk ` (required)
296333 * Send audio chunks until silence is detected
2973344 . &rarr ; ` audio-stop ` (required)
298- 5 . &larr ; ` transcript `
335+ 5 . &larr ; ` transcript ` (required)
299336 * Contains text transcription of spoken audio
337+
338+ Streaming:
339+
340+ 1 . &rarr ; ` transcribe ` event (optional)
341+ 2 . &rarr ; ` audio-start ` (required)
342+ 3 . &rarr ; ` audio-chunk ` (required)
343+ * Send audio chunks until silence is detected
344+ 4 . &larr ; ` transcript-start ` (required)
345+ 5 . &larr ; ` transcript-chunk ` (required)
346+ * Send transcript chunks as they're produced
347+ 6 . &rarr ; ` audio-stop ` (required)
348+ 7 . &larr ; ` transcript ` (required)
349+ * Sent for backwards compatibility
350+ 8 . &larr ; ` transcript-stop ` (required)
351+
300352
301353### Text to Speech
302354
@@ -306,6 +358,22 @@ Pipelines are run on the server, but can be triggered remotely from the server a
306358 * One or more audio chunks
3073594 . &larr ; ` audio-stop `
308360
361+ Streaming:
362+
363+ 1 . &rarr ; ` synthesize-start ` event (required)
364+ 3 . &rarr ; ` synthesize-chunk ` event (required)
365+ * Text chunks are sent as they're produced
366+ 3 . &larr ; ` audio-start ` , ` audio-chunk ` (one or more), ` audio-stop `
367+ * Audio chunks are sent as they're produced with start/stop
368+ 4 . &rarr ; ` synthesize ` event
369+ * Sent for backwards compatibility
370+ 5 . &rarr ; ` synthesize-stop ` event
371+ * End of text stream
372+ 6 . &larr ; Final audio must be sent
373+ * ` audio-start ` , ` audio-chunk ` (one or more), ` audio-stop `
374+ 7 . &larr ; ` synthesize-stopped `
375+ * Tells server that final audio has been sent
376+
309377### Wake Word Detection
310378
3113791 . &rarr ; ` detect ` event with ` names ` of wake words to detect (optional)
@@ -348,6 +416,16 @@ For text only:
3484162 . &larr ; ` handled ` if successful
3494173 . &larr ; ` not-handled ` if not successful
350418
419+ Streaming text only (successful):
420+
421+ 1 . &rarr ; ` transcript ` with ` text ` to handle (required)
422+ 2 . &larr ; ` handled-start ` (required)
423+ 3 . &larr ; ` handled-chunk ` (required)
424+ * Chunk of response text
425+ 4 . &larr ; ` handled ` (required)
426+ * Sent for backwards compatibility
427+ 5 . &larr ; ` handled-stop ` (required)
428+
351429### Audio Output
352430
3534311 . &rarr ; ` audio-start ` (required)
0 commit comments