Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions docs/doc/en/audio/digit.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ update:

## Maix-Speech

[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project.
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md).

## Continuous Chinese digit recognition

```python
from maix import app, nn

speech = nn.Speech("/root/models/am_3332_192_int8.mud")
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)

def callback(data: str, len: int):
print(data)
Expand All @@ -32,7 +32,6 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

Expand All @@ -55,10 +54,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
3. Choose the corresponding audio device

```python
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device
```

- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices.
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input.

```python
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input
Expand All @@ -74,11 +74,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
```

- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache:

- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache:

```python
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
```

4. Set up the decoder
Expand All @@ -89,11 +88,15 @@ def callback(data: str, len: int):

speech.digit(640, callback)
```
- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `digit` decoder is registered to output the Chinese digit recognition results from the last 4 seconds. The returned recognition results are in string format and support `0123456789 .(dot) S(ten) B(hundred) Q(thousand) W(thousand)`. For other decoder usages, please refer to the sections on Real-time voice recognition and keyword recognition.
- The user can configure multiple decoders simultaneously. `digit` decoder is registered to output the Chinese digit recognition results from the last 4 seconds. The returned recognition results are in string format and support `0123456789 .(dot) S(ten) B(hundred) Q(thousand) W(thousand)`.

- When setting the `digit` decoder, you need to specify a `blank` value; exceeding this value (in ms) will insert a `_` in the output results to indicate idle silence.

- After registering the decoder, use the `speech.deinit()` method to clear the initialization.
- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method.

```python
speech.dec_deinit(nn.SpeechDecoder.DECODER_DIG)
```

5. Recognition

Expand All @@ -102,12 +105,15 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

- Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.

- To clear the cache of recognized results, you can use the `speech.clear` method.

- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results.

### Recognition Results

If the above program runs successfully, speaking into the onboard microphone will yield continuous Chinese digit recognition results, such as:
Expand Down
28 changes: 17 additions & 11 deletions docs/doc/en/audio/keyword.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ update:

## Maix-Speech

[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project.
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md).

## Keyword recognition

```python
from maix import app, nn

speech = nn.Speech("/root/models/am_3332_192_int8.mud")
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)

kw_tbl = ['xiao3 ai4 tong2 xue2',
'ni3 hao3',
Expand All @@ -39,7 +39,6 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

Expand All @@ -62,10 +61,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
3. Choose the corresponding audio device

```python
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device
```

- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices.
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input.

```python
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input
Expand All @@ -81,11 +81,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
```

- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache:

- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache:

```python
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
```

4. Set up the decoder
Expand All @@ -103,7 +102,7 @@ def callback(data:list[float], len: int):

speech.kws(kw_tbl, kw_gate, callback, True)
```
- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `kws` decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation. For other decoder usages, please refer to the sections on Real-time voice recognition and continuous Chinese numeral recognition.
- The user can configure multiple decoders simultaneously. `kws` decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation.

- When setting up the `kws` decoder, you need to provide a `keyword list` separated by spaces in Pinyin, a `keyword probability threshold list` arranged in order, and specify whether to enable `automatic near-sound processing`. If set to `True`, different tones of the same Pinyin will be treated as similar words to accumulate probabilities. Finally, you need to set a callback function to handle the decoded data.

Expand All @@ -114,7 +113,11 @@ similar_char = ['zhen3', 'zheng3']
speech.similar('zen3', similar_char)
```

- After registering the decoder, use the `speech.deinit()` method to clear the initialization.
- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method.

```python
speech.dec_deinit(nn.SpeechDecoder.DECODER_KWS)
```

5. Recognition

Expand All @@ -123,12 +126,15 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

- Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.

- To clear the cache of recognized results, you can use the `speech.clear` method.

- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results.

### Recognition Results

If the above program runs successfully, speaking into the onboard microphone will yield keyword recognition results, such as:
Expand Down
28 changes: 17 additions & 11 deletions docs/doc/en/audio/recognize.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ update:

## Maix-Speech

[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project.
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md).

## Continuous Large Vocabulary Speech Recognition

```python
from maix import app, nn

speech = nn.Speech("/root/models/am_3332_192_int8.mud")
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)

def callback(data: tuple[str, str], len: int):
print(data)
Expand All @@ -36,7 +36,6 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

Expand All @@ -59,10 +58,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
3. Choose the corresponding audio device

```python
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device
```

- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices.
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input.

```python
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input
Expand All @@ -78,11 +78,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
```

- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache:

- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache:

```python
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
```

4. Set up the decoder
Expand All @@ -97,11 +96,15 @@ speech.lvcsr(lmS_path + "lg_6m.sfst", lmS_path + "lg_6m.sym", \
lmS_path + "phones.bin", lmS_path + "words_utf.bin", \
callback)
```
- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `lvcsr` decoder is registered to output continuous speech recognition results (for fewer than 1024 Chinese characters). For other decoder usages, please refer to the sections on continuous Chinese numeral recognition and keyword recognition.
- The user can configure multiple decoders simultaneously. `lvcsr` decoder is registered to output continuous speech recognition results (for fewer than 1024 Chinese characters).

- When setting up the `lvcsr` decoder, you need to specify the paths for the `sfst` file, the `sym` file (output symbol table), the path for `phones.bin` (phonetic table), and the path for `words.bin` (dictionary). Lastly, a callback function must be set to handle the decoded data.

- After registering the decoder, use the `speech.deinit()` method to clear the initialization.
- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method.

```python
speech.dec_deinit(nn.SpeechDecoder.DECODER_LVCSR)
```

5. Recognition

Expand All @@ -110,12 +113,15 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

- Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.

- To clear the cache of recognized results, you can use the `speech.clear` method.

- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results.

### Recognition Results

If the above program runs successfully, speaking into the onboard microphone will yield real-time speech recognition results, such as:
Expand Down
28 changes: 17 additions & 11 deletions docs/doc/zh/audio/digit.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ update:

## Maix-Speech

[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是专为嵌入式环境设计的离线语音库,其针对语音识别算法进行了深度优化,在内存占用上达到了数量级上的领先,并且保持了优良的WER。如果想了解原理可查看该开源项目
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是一款专为嵌入式环境设计的离线语音识别库,针对语音识别算法进行了深度优化,显著降低内存占用,同时在识别准确率方面表现优异。详细说明请参考 [Maix-Speech 使用文档](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md)

## 连续中文数字识别

```python
from maix import app, nn

speech = nn.Speech("/root/models/am_3332_192_int8.mud")
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)

def callback(data: str, len: int):
print(data)
Expand All @@ -32,7 +32,6 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

Expand All @@ -55,10 +54,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
3. 选择对应的音频设备

```python
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
speech.init(nn.SpeechDevice.DEVICE_MIC)
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # 指定音频输入设备
```

- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入设备
- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入

```python
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # 使用 WAV 音频输入
Expand All @@ -74,11 +74,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # 使用 PCM 音频
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
```

- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.devive` 方法,内部会自动进行缓存清除操作:

- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.device` 方法,内部会自动进行缓存清除操作:

```python
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
```

4. 设置解码器
Expand All @@ -89,11 +88,15 @@ def callback(data: str, len: int):

speech.digit(640, callback)
```
- 用户可以注册若干个解码器(也可以不注册),解码器的作用是解码声学模型的结果,并执行对应的用户回调。这里注册了一个 `digit` 解码器用于输出最近4s内的中文数字识别结果。返回的识别结果为字符串形式,支持 `0123456789 .(点) S(十) B(百) Q(千) W(万)`。对于其他解码器的使用可以查看语音实时识别和关键词识别部分
- 用户可以同时设置多个解码器,`digit` 解码器的作用是输出最近4s内的中文数字识别结果。返回的识别结果为字符串形式,支持 `0123456789 .(点) S(十) B(百) Q(千) W(万)`。

- 设置 `digit` 解码器时需要设置 `blank` 值,超过该值(ms)则在输出结果里插入一个 `_` 表示空闲静音

- 在注册完解码器后需要使用 `speech.deinit()` 方法清除初始化
- 如果不再需要使用某个解码器,可以通过调用 `speech.dec_deinit` 方法进行解除初始化。

```python
speech.dec_deinit(nn.SpeechDecoder.DECODER_DIG)
```

5. 识别

Expand All @@ -102,12 +105,15 @@ while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
speech.deinit()
break
```

- 使用 `speech.run` 方法运行语音识别,传入的参数为每次运行的帧数,返回实际运行的帧数。用户可以选择每次运行1帧后进行其他处理,或在一个线程中持续运行,使用外部线程进行停止。

- 若需清除已识别结果的缓存,可以使用 `speech.clear` 方法。

- 在识别过程中切换解码器,切换后的第一帧可能会出现识别错误。可以使用 `speech.skip_frames(1)` 跳过第一帧,确保后续结果准确。

### 识别结果

如果上述程序运行正常,对板载麦克风说话,会得到连续中文数字识别结果,如:
Expand Down
Loading