* update doc of deploy speech recognition

lxowalle · lxowalle · commit 799a204f35b4 · 2024-12-23T19:42:21.000+08:00
diff --git a/docs/doc/en/audio/deploy_online_recognition.md b/docs/doc/en/audio/deploy_online_recognition.md
@@ -0,0 +1,144 @@
+---
+title: MaixCAM MaixPy Deploy online speech recognition
+update:
+  - date: 2024-12-23
+    author: lxowalle
+    version: 1.0.0
+    content: Initial document
+---
+
+## Introduction
+
+Deploying online speech recognition locally is a solution for real-time processing of speech input. By running a speech recognition model on a local server and interacting with `MaixCAM`, it enables instant processing and result return of audio data without relying on external cloud services. This approach not only improves response speed but also protects user privacy, making it ideal for applications requiring high data security and real-time performance, such as smart hardware, industrial control, and real-time subtitle generation.
+
+This document uses the open-source framework [`sherpa-onnx`](https://github.com/k2-fsa/sherpa-onnx) for deployment. `sherpa-onnx` is a subproject of `sherpa`, supporting various tasks like streaming and non-streaming speech recognition, text-to-speech, speaker classification, speaker recognition, speaker verification, and spoken language recognition. Below, we mainly introduce how to achieve streaming speech recognition using `MaixCAM` and `sherpa-onnx`.
+
+> Note: Streaming speech recognition features high real-time performance, allowing recognition during speech. It is commonly used in real-time translation and voice assistants. Non-streaming recognition requires processing a complete sentence at a time and is known for its high accuracy.
+
+## Deploying the Speech Recognition Server
+
+`sherpa-onnx` supports deployment in multiple languages, including `C/C++`, `Python`, `Java`, and more. For simplicity, we will use `Python` for deployment. If you encounter any issues during the process, you can refer to the `sherpa` [documentation](https://k2-fsa.github.io/sherpa/intro.html). Let's get started!
+
+
+#### Download the `sherpa-onnx` Repository
+
+```shell
+git clone https://github.com/k2-fsa/sherpa-onnx.git
+```
+
+#### Install Dependencies
+
+```python
+pip install numpy
+pip install websockets
+```
+
+#### Install the `sherpa-onnx` Package
+
+```python
+pip install sherpa-onnx
+```
+
+If GPU support is required, install the CUDA-enabled package:
+
+```python
+pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
+
+# For users in China
+# pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda-cn.html
+```
+
+If the package is unavailable or installation fails, build and install from the source:
+
+```python
+cd sherpa-onnx
+export SHERPA_ONNX_CMAKE_ARGS="-DSHERPA_ONNX_ENABLE_GPU=ON"
+python3 setup.py install
+```
+
+If a GPU is available but `CUDA` is not installed, refer to the installation guide [`here`](https://k2-fsa.github.io/k2/installation/cuda-cudnn.html)
+
+#### Verify the Installation of `sherpa-onnx`
+
+```python
+python3 -c "import sherpa_onnx; print(sherpa_onnx.__version__)"
+
+# Expected output:
+# sherpa-onnx or 1.10.16+cuda
+```
+
+#### Download the Model
+
+[`Zipformer Bilingual Model for Mandarin and English:sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/.tar.bz2)
+
+[`Paraformer Trilingual Model for Mandarin, Cantonese, and English:sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en.tar.bz2)
+
+> Note：
+> For Chinese recognition, it is recommended to use the `sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile` model
+>
+> For English recognition, it is recommended to use the `sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en` model
+
+#### Run the Server
+
+`sherpa-onnx` provides a server example, so there's no need to write additional code. Follow these steps to start the server.
+
+##### Run the `zipformer` Model
+
+```shell
+cd sherpa-onnx
+export MODEL_PATH="sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20"
+python3 ./python-api-examples/streaming_server.py \
+  --encoder ./${MODEL_PATH}/encoder-epoch-99-avg-1.onnx \
+  --decoder ./${MODEL_PATH}/decoder-epoch-99-avg-1.onnx \
+  --joiner ./${MODEL_PATH}/joiner-epoch-99-avg-1.onnx \
+  --tokens ./${MODEL_PATH}/tokens.txt \
+  --provider "cuda"
+```
+
+##### Run the `paraformer` Model
+
+```shell
+cd sherpa-onnx
+export MODEL_PATH="sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en"
+python3 ./python-api-examples/streaming_server.py \
+  --paraformer-encoder ./${MODEL_PATH}/encoder.onnx \
+  --paraformer-decoder ./${MODEL_PATH}/decoder.onnx \
+  --tokens ./${MODEL_PATH}/tokens.txt \
+  --provider "cuda"
+
+```
+
+##### Example Log Output
+
+```shell
+2024-12-23 09:25:17,557 INFO [streaming_server.py:667] No certificate provided
+2024-12-23 09:25:17,561 INFO [server.py:715] server listening on [::]:6006
+2024-12-23 09:25:17,561 INFO [server.py:715] server listening on 0.0.0.0:6006
+2024-12-23 09:25:17,561 INFO [streaming_server.py:693] Please visit one of the following addresses:
+
+  http://localhost:6006
+
+Since you are not providing a certificate, you cannot use your microphone from within the browser using public IP addresses. Only localhost can be used.You also cannot use 0.0.0.0 or 127.0.0.1
+```
+
+At this point, the ASR model server is up and running.
+
+#### Communication Between `MaixCAM` and the Server
+
+For brevity, example client code is provided via the following links. Note that most cases require audio data with a sampling rate of 16000Hz and a single channel:
+
+[`MaixCAMMaixCAM` Streaming Recognition](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_streaming_websockt_client)
+
+[`MaixCAM` Non-Streaming Recognition](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_non_streaming_websockt_client)
+
+```shell
+# Update server address
+SERVER_ADDR = "127.0.0.1"
+SERVER_PORT = 6006
+```
+
+After updating the server address and port, use maixvision to run the client. If using the streaming recognition script, try interacting with MaixCAM.
+
+> Note: This document does not elaborate on the communication protocol because it is straightforward—essentially raw data exchange via WebSocket. It is recommended to first experience the setup and then delve into the code for further details.
+
+The deployment process is now complete.
diff --git a/docs/doc/en/sidebar.yaml b/docs/doc/en/sidebar.yaml
@@ -107,7 +107,8 @@ items:
         label: Speech synthesis
     -   file: pro/customize_model.md
         label: Customize new AI model
-
+    -   file: audio/deploy_online_recognition.md
+        label: Deploy online speech recognition
 -   label: Video
     items:
     -   file: video/record.md
diff --git a/docs/doc/zh/audio/deploy_online_recognition.md b/docs/doc/zh/audio/deploy_online_recognition.md
@@ -0,0 +1,148 @@
+---
+title: MaixCAM MaixPy 部署在线语音识别环境
+update:
+  - date: 2024-12-23
+    author: lxowalle
+    version: 1.0.0
+    content: 初版文档
+---
+
+## 简介
+
+本地部署在线语音识别是一种实现语音输入实时处理的解决方案。它通过在本地服务器上运行语音识别模型并与`MaixCAM`交互，无需依赖外部云服务，实现语音数据的即时处理和结果返回。这种方式不仅能够提升响应速度，还能更好地保护用户隐私，特别适用于对数据安全和实时性要求较高的应用场景，如智能硬件、工业控制和实时字幕生成等。
+
+本文选择了开源的[`sherpa-onnx`](https://github.com/k2-fsa/sherpa-onnx)框架进行部署, `sherpa-onnx`框架是`sherpa`的子项目, 支持流式语音识别,非流式语音识别,文本转语音,说话人分类,说话人识别,说话人验证,口语识别等等。下文主要介绍使用`MaixCAM`与`sherpa-onnx`实现流式语音识别.
+
+> 注: 流式语音识别的特点是实时性高,并且可以边说边识别, 常用于实时翻译, 语音助手等场景; 非流式语音识别的特点是必须每次推理完整的一句话,准确度高
+
+
+
+## 部署语音识别服务器
+
+`sherpa-onnx`支持非常多的语言部署，包括`C/C++`，`Python`，`Java`等等，为了部署方便，我们选择使用`Python`语言部署。下面操作过程中有任何疑问，可以自己先看一遍`sherpa`的[文档](https://k2-fsa.github.io/sherpa/intro.html)， 下面开始部署吧～
+
+#### 下载`sherpa-onnx`仓库
+
+```shell
+git clone https://github.com/k2-fsa/sherpa-onnx.git
+```
+
+#### 安装依赖包
+
+```python
+pip install numpy
+pip install websockets
+```
+
+#### 安装sherpa-onnx包
+
+```python
+pip install sherpa-onnx
+```
+
+如果需要使用`GPU`， 则下载带`cuda`的包
+
+```python
+pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
+
+# 中国用户可以使用
+# pip install sherpa-onnx==1.10.16+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda-cn.html
+```
+
+如果找不到包或安装失败，可以选择从源码编译安装
+
+```python
+cd sherpa-onnx
+export SHERPA_ONNX_CMAKE_ARGS="-DSHERPA_ONNX_ENABLE_GPU=ON"
+python3 setup.py install
+```
+
+如果有`GPU`但是没有`cuda`环境，则点击[`这里`](https://k2-fsa.github.io/k2/installation/cuda-cudnn.html)的方法安装对应版本`cuda`
+
+#### 检查`sherpa-onnx`包是否安装成功
+
+```python
+python3 -c "import sherpa_onnx; print(sherpa_onnx.__version__)"
+
+# 输出应该是
+# sherpa-onnx 或 1.10.16+cuda
+```
+
+#### 下载模型
+
+[`中英文双语zipformer模型 sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/.tar.bz2)
+
+[`中英文双语paraformer模型 sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en`](https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en.tar.bz2)
+
+> 注：中文识别建议用`sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-mobile`模型
+>
+> 英文识别建议用`sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en`模型
+
+#### 运行服务器
+
+`sherpa-onnx`提供了一个服务器的示例代码，所以不需要我们再造轮子编代码才能体验在线语音识别，启动方法看下面的示例
+
+##### 运行`zipformer`模型
+
+```shell
+cd sherpa-onnx
+export MODEL_PATH="sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20"
+python3 ./python-api-examples/streaming_server.py \
+  --encoder ./${MODEL_PATH}/encoder-epoch-99-avg-1.onnx \
+  --decoder ./${MODEL_PATH}/decoder-epoch-99-avg-1.onnx \
+  --joiner ./${MODEL_PATH}/joiner-epoch-99-avg-1.onnx \
+  --tokens ./${MODEL_PATH}/tokens.txt \
+  --provider "cuda"
+```
+
+这个示例运行了`streaming_server.py`作为服务器代码，其中`--encoder`、`--decoder`和`--joiner`是模型文件，`--tokens`是用来映射模型输出的列表， `--provider`用来指示是否启用`GPU`，默认使用`CPU`
+
+##### 运行`paraformer`模型
+
+```shell
+cd sherpa-onnx
+export MODEL_PATH="sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en"
+python3 ./python-api-examples/streaming_server.py \
+  --paraformer-encoder ./${MODEL_PATH}/encoder.onnx \
+  --paraformer-decoder ./${MODEL_PATH}/decoder.onnx \
+  --tokens ./${MODEL_PATH}/tokens.txt \
+  --provider "cuda"
+
+```
+
+这个示例运行了`streaming_server.py`作为服务器代码，其中`--paraformer-encoder`和`--paraformer-encoder`是模型文件，`--tokens`是用来映射模型输出的列表， `--provider`用来指示是否启用`GPU`，默认使用`CPU`
+
+##### 运行成功后的日志
+
+```shell
+2024-12-23 09:25:17,557 INFO [streaming_server.py:667] No certificate provided
+2024-12-23 09:25:17,561 INFO [server.py:715] server listening on [::]:6006
+2024-12-23 09:25:17,561 INFO [server.py:715] server listening on 0.0.0.0:6006
+2024-12-23 09:25:17,561 INFO [streaming_server.py:693] Please visit one of the following addresses:
+
+  http://localhost:6006
+
+Since you are not providing a certificate, you cannot use your microphone from within the browser using public IP addresses. Only localhost can be used.You also cannot use 0.0.0.0 or 127.0.0.1
+```
+
+至此ASR模型服务器就跑起来了，开始与服务器通信
+
+#### 基于`MaixCAM`与服务器通信
+
+为了简化篇幅这里放了示例客户端代码的链接，自行拷贝。注意大部分情况音频数据要求采样率`16000Hz`， 采样通道为`1`。
+
+`MaixCAM`流式识别点击[这里](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_streaming_websockt_client)获取代码
+
+`MaixCAM`非流式识别点击[这里](https://github.com/sipeed/MaixPy/blob/main/examples/audio/asr/asr_non_streaming_websockt_client)获取代码
+
+```shell
+# 修改服务器地址
+SERVER_ADDR = "127.0.0.1"
+SERVER_PORT = 6006
+```
+
+修改服务器地址和端口号后，再使用`maixvision`运行即可。如果你运行的是流式识别的代码，那么尝试跟`MaixCAM`开始对话吧～
+
+> 注：这里没有过多赘述客户端和服务器通信的协议的原因之一是因为它们通信很简单，基本是`websocket`连接后的数据裸收发，建议先上手体验后直接看代码来了解真正想知道的信息。
+
+至此就部署完成了
diff --git a/docs/doc/zh/sidebar.yaml b/docs/doc/zh/sidebar.yaml
@@ -108,6 +108,8 @@ items:
         label: AI 声音分类
     -   file: pro/customize_model.md
         label: 移植新模型
+    -   file: audio/deploy_online_recognition.md
+        label: 部署在线语音识别环境
 
 -   label: 视频
     items:
diff --git a/examples/audio/asr/asr_non_streaming_websockt_client.py b/examples/audio/asr/asr_non_streaming_websockt_client.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+
+from maix import audio
+import asyncio
+import json
+import wave
+import numpy as np
+import websockets
+
+SERVER_ADDR = "127.0.0.1"
+SERVER_PORT = 6006
+def read_wave(wave_filename: str) -> np.ndarray:
+    with wave.open(wave_filename) as f:
+        assert f.getframerate() == 16000, f.getframerate()
+        assert f.getnchannels() == 1, f.getnchannels()
+        assert f.getsampwidth() == 2, f.getsampwidth()  # it is in bytes
+        num_samples = f.getnframes()
+        samples = f.readframes(num_samples)
+        samples_int16 = np.frombuffer(samples, dtype=np.int16)
+        samples_float32 = samples_int16.astype(np.float32)
+
+        samples_float32 = samples_float32 / 32768
+        return samples_float32
+
+async def receive_results(socket):
+    last_message = ""
+    async for message in socket:
+        if message != "Done!":
+            last_message = message
+            print(json.loads(message))
+        else:
+            break
+    return last_message
+
+
+async def run(
+    server_addr: str,
+    server_port: int,
+    wave_filename: str,
+    samples_per_message: int,
+    seconds_per_message: float,
+):
+    data = read_wave(wave_filename)
+
+    async with websockets.connect(
+        f"ws://{server_addr}:{server_port}"
+    ) as websocket:  # noqa
+        receive_task = asyncio.create_task(receive_results(websocket))
+
+        start = 0
+        while start < data.shape[0]:
+            end = start + samples_per_message
+            end = min(end, data.shape[0])
+            d = data.data[start:end].tobytes()
+
+            await websocket.send(d)
+            await asyncio.sleep(seconds_per_message)
+            start += samples_per_message
+
+        await websocket.send("Done")
+        await receive_task
+
+async def main():
+    wav_path = "/tmp/test.wav"
+    recorder = audio.Recorder(wav_path, sample_rate=16000, channel=1)
+    recorder.volume(100)
+    print('Please Speak..')
+    recorder.record(3 * 1000)
+    recorder.finish()
+    print('Recording complete, upload wav file to server..')
+
+    await run(
+        server_addr=SERVER_ADDR,
+        server_port=SERVER_PORT,
+        wave_filename=wav_path,
+        samples_per_message=8000,
+        seconds_per_message=0.1,
+    )
+
+if __name__ == "__main__":
+    asyncio.run(main())
diff --git a/examples/audio/asr/asr_streaming_websocket_client.py b/examples/audio/asr/asr_streaming_websocket_client.py