Skip to content

Commit cfb5c7e

Browse files
authored
Merge pull request #107 from second-state/docs-config
Add config section
2 parents 5322b26 + e1a68da commit cfb5c7e

21 files changed

+752
-481
lines changed

doc/docs/config/_category_.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"label": "Config guide",
3+
"position": 3,
4+
"link": {
5+
"type": "generated-index",
6+
"description": "In this chapter, you'll learn how to configure the EchoKit server."
7+
}
8+
}

doc/docs/config/asr.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
sidebar_position: 2
3+
---
4+
5+
# Voice to text services (ASR)
6+
7+
The EchoKit server supports popular ASR providers.
8+
9+
| Platform | URL example | Notes |
10+
| ------------- | ------------- | ---- |
11+
| `openai` | `https://api.openai.com/v1/audio/transcriptions` | Supports endpoint URLs from any OpenAI-compatible services, such as Groq and Open Router. |
12+
| `paraformer_v2` | `wss://dashscope.aliyuncs.com/api-ws/v1/inference` | A Web socket streaming ASR service endpoint supported by the ALi Cloud |
13+
14+
15+
## OpenAI and compatible services
16+
17+
The OpenAI `/v1/audio/transcriptions` API is supported by OpenAI, Open Router, Groq, Azure, AWS and many other providers.
18+
This is a non-streaming service endpoint, meaning that EchoKit server must determine when the user is done
19+
talking (via an VAD service), and then submit the entire audio to get a transscription.
20+
21+
OpenAI example
22+
23+
```toml
24+
[asr]
25+
platform = "openai"
26+
url = "https://api.openai.com/v1/audio/transcriptions"
27+
api_key = "sk_ABCD"
28+
model = "gpt-4o-mini-transcribe"
29+
lang = "en"
30+
vad_url = "http://localhost:9093/v1/audio/vad"
31+
```
32+
33+
Groq example
34+
35+
```toml
36+
[asr]
37+
platform = "openai"
38+
url = "https://api.groq.com/openai/v1/audio/transcriptions"
39+
api_key = "gsk_ABCD"
40+
model = "whisper-large-v3"
41+
lang = "en"
42+
prompt = "Hello\n你好\n(noise)\n(bgm)\n(silence)\n"
43+
vad_url = "http://localhost:9093/v1/audio/vad"
44+
```
45+
46+
Notice that in both examples, we are using a locally hosted VAD service to detect when the user is finished speaking. It is optional and you can [learn about it here](../server/vad.md).
47+
48+
## Ali Cloud streaming ASR
49+
50+
The [Bailian service](https://bailian.console.aliyun.com/) from Ali Cloud provides excellent ASR models for Chinese language recognition.
51+
It is also a streaming ASR service -- it would take an audio stream as input and
52+
send back text and voice activity events as they happen. There is no need to a separate VAD service in this case.
53+
54+
```toml
55+
[asr]
56+
platform = "paraformer_v2"
57+
url = "wss://dashscope.aliyuncs.com/api-ws/v1/inference"
58+
paraformer_token = "sk-API-KEY"
59+
```
60+
61+
## ElevenLabs streaming ASR
62+
63+
Coming soon ...
64+
Lines changed: 8 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ sidebar_position: 5
44

55
# Configure an End-to-End Pipeline for EchoKit
66

7-
In addition to the classic [ASR-LLM-TTS pipeline](./configure-echokit-server.md), EchoKit supports real-time models that can reduce latency. However, this approach has several limitations:
7+
EchoKit supports real-time models that can reduce latency. However, this approach has several limitations:
88

9-
* **High API costs**OpenAI's real-time API can cost up to $25 per 100 tokens
9+
* **High API costs**Real-time API can cost up to $25 per million tokens
1010
* **No voice customization** – You cannot modify the generated voice
1111
* **Limited knowledge integration** – External knowledge bases cannot be added to the model
1212
* **No MCP support** – Model Control Protocol is not supported in most cases
@@ -15,9 +15,9 @@ In addition to the classic [ASR-LLM-TTS pipeline](./configure-echokit-server.md)
1515

1616
Before setting up your end-to-end pipeline, ensure you have:
1717

18-
* **EchoKit server source code** – Follow the [guide](./echokit-server.md) if you haven't already
18+
* **EchoKit server source code** – Follow the [guide](../get-started/echokit-server.md) if you haven't already
1919
* **Gemini API key** – Obtain from [Google AI Studio](https://aistudio.google.com/)
20-
* **TTS service running** (optional) – If using custom voice synthesis
20+
* **TTS service** (optional)
2121

2222
## Gemini API Setup
2323

@@ -36,7 +36,7 @@ Google's Gemini is one of the most advanced models supporting voice-to-voice int
3636
Here's the complete configuration file for Gemini:
3737

3838
```toml
39-
addr = "0.0.0.0:9090"
39+
addr = "0.0.0.0:8080"
4040
hello_wav = "hello.wav"
4141

4242
[gemini]
@@ -49,16 +49,6 @@ You are a helpful assistant. Please answer user questions as concisely as possib
4949
"""
5050
```
5151

52-
### Starting the Server
53-
54-
After editing the configuration file, restart the EchoKit server to apply the changes.
55-
56-
Since you're using a different `config.toml` file in a custom path, your restart command should look like this:
57-
58-
```bash
59-
./target/release/echokit_server ./examples/gemini/chat/config.toml
60-
```
61-
6252
## Gemini + TTS (Custom Voice)
6353

6454
While real-time models typically don't allow voice customization, EchoKit enables you to customize the voice even when using Gemini!
@@ -68,14 +58,14 @@ While real-time models typically don't allow voice customization, EchoKit enable
6858
Simply add TTS-related parameters to your `config.toml` file:
6959

7060
```toml
71-
addr = "0.0.0.0:9090"
61+
addr = "0.0.0.0:8080"
7262
hello_wav = "hello.wav"
7363

7464
[gemini]
7565
api_key = "your_api_key_here"
7666

7767
[tts]
78-
platform = "StreamGSV"
68+
platform = "stream_gsv"
7969
url = "http://localhost:9094/v1/audio/stream_speech"
8070
speaker = "cooper"
8171

@@ -86,4 +76,4 @@ You are a helpful assistant. Please answer user questions as concisely as possib
8676
"""
8777
```
8878

89-
With these TTS settings configured, you can now use your preferred custom voice.
79+
With these TTS settings configured, you can now use your preferred custom voice.

doc/docs/config/intro.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
sidebar_position: 1
3+
---
4+
5+
# EchoKit server config options
6+
7+
The EchoKit server orchestrates multiple AI services to turn user voice input into voice responses.
8+
It generally takes two approaches.
9+
10+
* The pipeline approach. It divides up the task into multiple steps, and use a different AI service to process each step.
11+
* The [ASR service](asr.md) turns the user input voice audio into text.
12+
* The [LLM service](llm.md) generates a text response to the user input. The LLM could be aided by [built-in tools, such as web searches](llm-tools.md) and [custom tools in MCP servers](mcp.md).
13+
* The [TTS service](tts.md) converts the response text to voice.
14+
* The end-to-end real-time model approach. It utilizes multimodal models that could directly ingest voice input and generate voice output, such as [Google Gemini Live](gemini-live.md).
15+
16+
The pipeline approach offers greater flexibility and customization - you can choose any voice, control costs by mixing different providers, integrate external knowledge, and run components locally for privacy. While end-to-end models can reduce the latency, the classic pipeline gives you full control over each component.
17+
18+
You can configure how those AI services work together through EchoKit server's `config.toml` file.
19+
20+
## Prerequisites
21+
22+
* Started an EchoKit server. Follow [the quick start guide](../get-started/echokit-server.md) if needed
23+
* Obtained **API keys** for your favoriate AI API providers (OpenAI, Groq, xai, Open Router, ElevenLabs, Gemini etc.)
24+
25+
26+
## Configure server address and welcome audio
27+
28+
```toml
29+
addr = "0.0.0.0:8080"
30+
hello_wav = "hello.wav"
31+
```
32+
33+
* `addr`: The server's listening address and port
34+
* Use `0.0.0.0` to accept connections from any network interface
35+
* Make sure that your firewall allows incoming connections to the port (`8080` in this example)
36+
* `hello_wav`: Optional welcome audio file played when a device connects
37+
* Supports 16kHz WAV format
38+
* Make sure that the file is in the same folder as `config.toml`
39+
40+
## Configure AI services
41+
42+
The rest of the `config.toml` specifies how to use different AI services. Each service will be covered in its own chapter.
43+
44+
* The `[asr]` section configures the [voice-to-text](asr.md) services.
45+
* The `[llm]` section configures the [large language model](llm.md) services, including [tools](llm-tools.md) and [MCP actions](mcp.md).
46+
* The `[tts]` section configures the [text-to-voice](tts.md) services.
47+
48+
It is important to note that each of sections has those fields.
49+
50+
* A `platform` field that designates the service protocol. A common example is `openai` for OpenAI compatible API endpoints.
51+
* A `url` field for the service URL endpoint. It is typically a `https://` or `wss://` URL. The latter is the Web Socket address for streaming services.
52+
* Optional fields that are specific to the `platform`. That includes `api_key`, `model`, and others.
53+
54+
## Complete Configuration Example
55+
56+
You will need a free [API key from Groq](https://console.groq.com/keys).
57+
58+
```toml
59+
# Server settings
60+
addr = "0.0.0.0:8080"
61+
hello_wav = "hello.wav"
62+
63+
# Speech recognition using the OpenAI transcriptions API, but hosted by Groq (instead of OpenAI)
64+
[asr]
65+
platform = "openai"
66+
url = "https://api.groq.com/openai/v1/audio/transcriptions"
67+
lang = "en"
68+
api_key = "gsk_your_api_key_here"
69+
model = "whisper-large-v3-turbo"
70+
71+
# Language model using the OpenAI chat completions API, but hosted by Groq (instead of OpenAI)
72+
[llm]
73+
platform = "openai_chat"
74+
url = "https://api.groq.com/openai/v1/chat/completions"
75+
api_key = "gsk_your_api_key_here"
76+
model = "gpt-oss-20b"
77+
history = 10
78+
79+
# Text-to-speech using the OpenAI speech API, but hosted by Groq (instead of OpenAI)
80+
[tts]
81+
platform = "openai"
82+
url = "https://api.groq.com/openai/v1/audio/speech"
83+
api_key = "gsk_your_api_key_here"
84+
model = "playai-tts"
85+
voice = "Cooper-PlayAI"
86+
87+
# System personality
88+
[[llm.sys_prompts]]
89+
role = "system"
90+
content = """
91+
Your name is EchoKit, a helpful AI assistant. Provide clear, concise responses and maintain a friendly, professional tone. Keep answers brief but informative.
92+
"""
93+
```
94+

0 commit comments

Comments
 (0)