Skip to content

Commit 799a204

Browse files
committed
* update doc of deploy speech recognition
1 parent 71ff031 commit 799a204

File tree

6 files changed

+456
-1
lines changed

6 files changed

+456
-1
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
title: MaixCAM MaixPy Deploy online speech recognition
3+
update:
4+
- date: 2024-12-23
5+
author: lxowalle
6+
version: 1.0.0
7+
content: Initial document
8+
---
9+
10+
## Introduction
11+
12+
Deploying online speech recognition locally is a solution for real-time processing of speech input. By running a speech recognition model on a local server and interacting with `MaixCAM`, it enables instant processing and result return of audio data without relying on external cloud services. This approach not only improves response speed but also protects user privacy, making it ideal for applications requiring high data security and real-time performance, such as smart hardware, industrial control, and real-time subtitle generation.
13+
14+
This document uses the open-source framework [`sherpa-onnx`](https://github.com/k2-fsa/sherpa-onnx) for deployment. `sherpa-onnx` is a subproject of `sherpa`, supporting various tasks like streaming and non-streaming speech recognition, text-to-speech, speaker classification, speaker recognition, speaker verification, and spoken language recognition. Below, we mainly introduce how to achieve streaming speech recognition using `MaixCAM` and `sherpa-onnx`.
15+
16+
> Note: Streaming speech recognition features high real-time performance, allowing recognition during speech. It is commonly used in real-time translation and voice assistants. Non-streaming recognition requires processing a complete sentence at a time and is known for its high accuracy.
17+
18+
## Deploying the Speech Recognition Server
19+
20+
`sherpa-onnx` supports deployment in multiple languages, including `C/C++`, `Python`, `Java`, and more. For simplicity, we will use `Python` for deployment. If you encounter any issues during the process, you can refer to the `sherpa` [documentation](https://k2-fsa.github.io/sherpa/intro.html). Let's get started!
21+
22+
23+
#### Download the `sherpa-onnx` Repository
24+
25+
```shell
26+
git clone https://github.com/k2-fsa/sherpa-onnx.git
27+
```
28+
29+
#### Install Dependencies
30+
31+
```python
32+
pip install numpy
33+
pip install websockets
34+
```
35+
36+
#### Install the `sherpa-onnx` Package
37+
38+
```python
39+
pip install sherpa-onnx
40+
```
41+
42+
If GPU support is required, install the CUDA-enabled package:
43+
44+
```python
45+
pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
46+
47+
# For users in China
48+
# pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda-cn.html
49+
```
50+
51+
If the package is unavailable or installation fails, build and install from the source:
52+
53+
```python
54+
cd sherpa-onnx
55+
export SHERPA_ONNX_CMAKE_ARGS="-DSHERPA_ONNX_ENABLE_GPU=ON"
56+
python3 setup.py install
57+
```
58+
59+
If a GPU is available but `CUDA` is not installed, refer to the installation guide [`here`](https://k2-fsa.github.io/k2/installation/cuda-cudnn.html)
60+
61+
#### Verify the Installation of `sherpa-onnx`
62+
63+
```python
64+
python3 -c "import sherpa_onnx; print(sherpa_onnx.__version__)"
65+
66+
# Expected output:
67+
# sherpa-onnx or 1.10.16+cuda
68+
```
69+
70+
#### Download the Model
71+
72+
[`Zipformer Bilingual Model for Mandarin and English:sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/.tar.bz2)
73+
74+
[`Paraformer Trilingual Model for Mandarin, Cantonese, and English:sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en.tar.bz2)
75+
76+
> Note:
77+
> For Chinese recognition, it is recommended to use the `sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile` model
78+
>
79+
> For English recognition, it is recommended to use the `sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en` model
80+
81+
#### Run the Server
82+
83+
`sherpa-onnx` provides a server example, so there's no need to write additional code. Follow these steps to start the server.
84+
85+
##### Run the `zipformer` Model
86+
87+
```shell
88+
cd sherpa-onnx
89+
export MODEL_PATH="sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20"
90+
python3 ./python-api-examples/streaming_server.py \
91+
--encoder ./${MODEL_PATH}/encoder-epoch-99-avg-1.onnx \
92+
--decoder ./${MODEL_PATH}/decoder-epoch-99-avg-1.onnx \
93+
--joiner ./${MODEL_PATH}/joiner-epoch-99-avg-1.onnx \
94+
--tokens ./${MODEL_PATH}/tokens.txt \
95+
--provider "cuda"
96+
```
97+
98+
##### Run the `paraformer` Model
99+
100+
```shell
101+
cd sherpa-onnx
102+
export MODEL_PATH="sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en"
103+
python3 ./python-api-examples/streaming_server.py \
104+
--paraformer-encoder ./${MODEL_PATH}/encoder.onnx \
105+
--paraformer-decoder ./${MODEL_PATH}/decoder.onnx \
106+
--tokens ./${MODEL_PATH}/tokens.txt \
107+
--provider "cuda"
108+
109+
```
110+
111+
##### Example Log Output
112+
113+
```shell
114+
2024-12-23 09:25:17,557 INFO [streaming_server.py:667] No certificate provided
115+
2024-12-23 09:25:17,561 INFO [server.py:715] server listening on [::]:6006
116+
2024-12-23 09:25:17,561 INFO [server.py:715] server listening on 0.0.0.0:6006
117+
2024-12-23 09:25:17,561 INFO [streaming_server.py:693] Please visit one of the following addresses:
118+
119+
http://localhost:6006
120+
121+
Since you are not providing a certificate, you cannot use your microphone from within the browser using public IP addresses. Only localhost can be used.You also cannot use 0.0.0.0 or 127.0.0.1
122+
```
123+
124+
At this point, the ASR model server is up and running.
125+
126+
#### Communication Between `MaixCAM` and the Server
127+
128+
For brevity, example client code is provided via the following links. Note that most cases require audio data with a sampling rate of 16000Hz and a single channel:
129+
130+
[`MaixCAMMaixCAM` Streaming Recognition](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_streaming_websockt_client)
131+
132+
[`MaixCAM` Non-Streaming Recognition](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_non_streaming_websockt_client)
133+
134+
```shell
135+
# Update server address
136+
SERVER_ADDR = "127.0.0.1"
137+
SERVER_PORT = 6006
138+
```
139+
140+
After updating the server address and port, use maixvision to run the client. If using the streaming recognition script, try interacting with MaixCAM.
141+
142+
> Note: This document does not elaborate on the communication protocol because it is straightforward—essentially raw data exchange via WebSocket. It is recommended to first experience the setup and then delve into the code for further details.
143+
144+
The deployment process is now complete.

docs/doc/en/sidebar.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,8 @@ items:
107107
label: Speech synthesis
108108
- file: pro/customize_model.md
109109
label: Customize new AI model
110-
110+
- file: audio/deploy_online_recognition.md
111+
label: Deploy online speech recognition
111112
- label: Video
112113
items:
113114
- file: video/record.md
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
title: MaixCAM MaixPy 部署在线语音识别环境
3+
update:
4+
- date: 2024-12-23
5+
author: lxowalle
6+
version: 1.0.0
7+
content: 初版文档
8+
---
9+
10+
## 简介
11+
12+
本地部署在线语音识别是一种实现语音输入实时处理的解决方案。它通过在本地服务器上运行语音识别模型并与`MaixCAM`交互,无需依赖外部云服务,实现语音数据的即时处理和结果返回。这种方式不仅能够提升响应速度,还能更好地保护用户隐私,特别适用于对数据安全和实时性要求较高的应用场景,如智能硬件、工业控制和实时字幕生成等。
13+
14+
本文选择了开源的[`sherpa-onnx`](https://github.com/k2-fsa/sherpa-onnx)框架进行部署, `sherpa-onnx`框架是`sherpa`的子项目, 支持流式语音识别,非流式语音识别,文本转语音,说话人分类,说话人识别,说话人验证,口语识别等等。下文主要介绍使用`MaixCAM``sherpa-onnx`实现流式语音识别.
15+
16+
> 注: 流式语音识别的特点是实时性高,并且可以边说边识别, 常用于实时翻译, 语音助手等场景; 非流式语音识别的特点是必须每次推理完整的一句话,准确度高
17+
18+
19+
20+
## 部署语音识别服务器
21+
22+
`sherpa-onnx`支持非常多的语言部署,包括`C/C++``Python``Java`等等,为了部署方便,我们选择使用`Python`语言部署。下面操作过程中有任何疑问,可以自己先看一遍`sherpa`[文档](https://k2-fsa.github.io/sherpa/intro.html), 下面开始部署吧~
23+
24+
#### 下载`sherpa-onnx`仓库
25+
26+
```shell
27+
git clone https://github.com/k2-fsa/sherpa-onnx.git
28+
```
29+
30+
#### 安装依赖包
31+
32+
```python
33+
pip install numpy
34+
pip install websockets
35+
```
36+
37+
#### 安装sherpa-onnx包
38+
39+
```python
40+
pip install sherpa-onnx
41+
```
42+
43+
如果需要使用`GPU`, 则下载带`cuda`的包
44+
45+
```python
46+
pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
47+
48+
# 中国用户可以使用
49+
# pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda-cn.html
50+
```
51+
52+
如果找不到包或安装失败,可以选择从源码编译安装
53+
54+
```python
55+
cd sherpa-onnx
56+
export SHERPA_ONNX_CMAKE_ARGS="-DSHERPA_ONNX_ENABLE_GPU=ON"
57+
python3 setup.py install
58+
```
59+
60+
如果有`GPU`但是没有`cuda`环境,则点击[`这里`](https://k2-fsa.github.io/k2/installation/cuda-cudnn.html)的方法安装对应版本`cuda`
61+
62+
#### 检查`sherpa-onnx`包是否安装成功
63+
64+
```python
65+
python3 -c "import sherpa_onnx; print(sherpa_onnx.__version__)"
66+
67+
# 输出应该是
68+
# sherpa-onnx 或 1.10.16+cuda
69+
```
70+
71+
#### 下载模型
72+
73+
[`中英文双语zipformer模型 sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/.tar.bz2)
74+
75+
[`中英文双语paraformer模型 sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en.tar.bz2)
76+
77+
> 注:中文识别建议用`sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile`模型
78+
>
79+
> 英文识别建议用`sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en`模型
80+
81+
#### 运行服务器
82+
83+
`sherpa-onnx`提供了一个服务器的示例代码,所以不需要我们再造轮子编代码才能体验在线语音识别,启动方法看下面的示例
84+
85+
##### 运行`zipformer`模型
86+
87+
```shell
88+
cd sherpa-onnx
89+
export MODEL_PATH="sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20"
90+
python3 ./python-api-examples/streaming_server.py \
91+
--encoder ./${MODEL_PATH}/encoder-epoch-99-avg-1.onnx \
92+
--decoder ./${MODEL_PATH}/decoder-epoch-99-avg-1.onnx \
93+
--joiner ./${MODEL_PATH}/joiner-epoch-99-avg-1.onnx \
94+
--tokens ./${MODEL_PATH}/tokens.txt \
95+
--provider "cuda"
96+
```
97+
98+
这个示例运行了`streaming_server.py`作为服务器代码,其中`--encoder``--decoder``--joiner`是模型文件,`--tokens`是用来映射模型输出的列表, `--provider`用来指示是否启用`GPU`,默认使用`CPU`
99+
100+
##### 运行`paraformer`模型
101+
102+
```shell
103+
cd sherpa-onnx
104+
export MODEL_PATH="sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en"
105+
python3 ./python-api-examples/streaming_server.py \
106+
--paraformer-encoder ./${MODEL_PATH}/encoder.onnx \
107+
--paraformer-decoder ./${MODEL_PATH}/decoder.onnx \
108+
--tokens ./${MODEL_PATH}/tokens.txt \
109+
--provider "cuda"
110+
111+
```
112+
113+
这个示例运行了`streaming_server.py`作为服务器代码,其中`--paraformer-encoder``--paraformer-encoder`是模型文件,`--tokens`是用来映射模型输出的列表, `--provider`用来指示是否启用`GPU`,默认使用`CPU`
114+
115+
##### 运行成功后的日志
116+
117+
```shell
118+
2024-12-23 09:25:17,557 INFO [streaming_server.py:667] No certificate provided
119+
2024-12-23 09:25:17,561 INFO [server.py:715] server listening on [::]:6006
120+
2024-12-23 09:25:17,561 INFO [server.py:715] server listening on 0.0.0.0:6006
121+
2024-12-23 09:25:17,561 INFO [streaming_server.py:693] Please visit one of the following addresses:
122+
123+
http://localhost:6006
124+
125+
Since you are not providing a certificate, you cannot use your microphone from within the browser using public IP addresses. Only localhost can be used.You also cannot use 0.0.0.0 or 127.0.0.1
126+
```
127+
128+
至此ASR模型服务器就跑起来了,开始与服务器通信
129+
130+
#### 基于`MaixCAM`与服务器通信
131+
132+
为了简化篇幅这里放了示例客户端代码的链接,自行拷贝。注意大部分情况音频数据要求采样率`16000Hz`, 采样通道为`1`
133+
134+
`MaixCAM`流式识别点击[这里](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_streaming_websockt_client)获取代码
135+
136+
`MaixCAM`非流式识别点击[这里](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_non_streaming_websockt_client)获取代码
137+
138+
```shell
139+
# 修改服务器地址
140+
SERVER_ADDR = "127.0.0.1"
141+
SERVER_PORT = 6006
142+
```
143+
144+
修改服务器地址和端口号后,再使用`maixvision`运行即可。如果你运行的是流式识别的代码,那么尝试跟`MaixCAM`开始对话吧~
145+
146+
> 注:这里没有过多赘述客户端和服务器通信的协议的原因之一是因为它们通信很简单,基本是`websocket`连接后的数据裸收发,建议先上手体验后直接看代码来了解真正想知道的信息。
147+
148+
至此就部署完成了

docs/doc/zh/sidebar.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ items:
108108
label: AI 声音分类
109109
- file: pro/customize_model.md
110110
label: 移植新模型
111+
- file: audio/deploy_online_recognition.md
112+
label: 部署在线语音识别环境
111113

112114
- label: 视频
113115
items:
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
#!/usr/bin/env python3
2+
3+
from maix import audio
4+
import asyncio
5+
import json
6+
import wave
7+
import numpy as np
8+
import websockets
9+
10+
SERVER_ADDR = "127.0.0.1"
11+
SERVER_PORT = 6006
12+
def read_wave(wave_filename: str) -> np.ndarray:
13+
with wave.open(wave_filename) as f:
14+
assert f.getframerate() == 16000, f.getframerate()
15+
assert f.getnchannels() == 1, f.getnchannels()
16+
assert f.getsampwidth() == 2, f.getsampwidth() # it is in bytes
17+
num_samples = f.getnframes()
18+
samples = f.readframes(num_samples)
19+
samples_int16 = np.frombuffer(samples, dtype=np.int16)
20+
samples_float32 = samples_int16.astype(np.float32)
21+
22+
samples_float32 = samples_float32 / 32768
23+
return samples_float32
24+
25+
async def receive_results(socket):
26+
last_message = ""
27+
async for message in socket:
28+
if message != "Done!":
29+
last_message = message
30+
print(json.loads(message))
31+
else:
32+
break
33+
return last_message
34+
35+
36+
async def run(
37+
server_addr: str,
38+
server_port: int,
39+
wave_filename: str,
40+
samples_per_message: int,
41+
seconds_per_message: float,
42+
):
43+
data = read_wave(wave_filename)
44+
45+
async with websockets.connect(
46+
f"ws://{server_addr}:{server_port}"
47+
) as websocket: # noqa
48+
receive_task = asyncio.create_task(receive_results(websocket))
49+
50+
start = 0
51+
while start < data.shape[0]:
52+
end = start + samples_per_message
53+
end = min(end, data.shape[0])
54+
d = data.data[start:end].tobytes()
55+
56+
await websocket.send(d)
57+
await asyncio.sleep(seconds_per_message)
58+
start += samples_per_message
59+
60+
await websocket.send("Done")
61+
await receive_task
62+
63+
async def main():
64+
wav_path = "/tmp/test.wav"
65+
recorder = audio.Recorder(wav_path, sample_rate=16000, channel=1)
66+
recorder.volume(100)
67+
print('Please Speak..')
68+
recorder.record(3 * 1000)
69+
recorder.finish()
70+
print('Recording complete, upload wav file to server..')
71+
72+
await run(
73+
server_addr=SERVER_ADDR,
74+
server_port=SERVER_PORT,
75+
wave_filename=wav_path,
76+
samples_per_message=8000,
77+
seconds_per_message=0.1,
78+
)
79+
80+
if __name__ == "__main__":
81+
asyncio.run(main())

0 commit comments

Comments
 (0)