diff --git a/docs/doc/en/audio/digit.md b/docs/doc/en/audio/digit.md index a2dd36d6..61b16d55 100644 --- a/docs/doc/en/audio/digit.md +++ b/docs/doc/en/audio/digit.md @@ -13,7 +13,7 @@ update: ## Maix-Speech -[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project. +[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md). ## Continuous Chinese digit recognition @@ -21,7 +21,7 @@ update: from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) def callback(data: str, len: int): print(data) @@ -32,7 +32,6 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` @@ -55,10 +54,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud") 3. Choose the corresponding audio device ```python -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) +speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device ``` -- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices. +- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input. ```python speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input @@ -74,11 +74,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav ``` -- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache: - +- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache: ```python -speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") +speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") ``` 4. Set up the decoder @@ -89,11 +88,15 @@ def callback(data: str, len: int): speech.digit(640, callback) ``` -- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `digit` decoder is registered to output the Chinese digit recognition results from the last 4 seconds. The returned recognition results are in string format and support `0123456789 .(dot) S(ten) B(hundred) Q(thousand) W(thousand)`. For other decoder usages, please refer to the sections on Real-time voice recognition and keyword recognition. +- The user can configure multiple decoders simultaneously. `digit` decoder is registered to output the Chinese digit recognition results from the last 4 seconds. The returned recognition results are in string format and support `0123456789 .(dot) S(ten) B(hundred) Q(thousand) W(thousand)`. - When setting the `digit` decoder, you need to specify a `blank` value; exceeding this value (in ms) will insert a `_` in the output results to indicate idle silence. -- After registering the decoder, use the `speech.deinit()` method to clear the initialization. +- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method. + +```python +speech.dec_deinit(nn.SpeechDecoder.DECODER_DIG) +``` 5. Recognition @@ -102,12 +105,15 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` - Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread. +- To clear the cache of recognized results, you can use the `speech.clear` method. + +- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results. + ### Recognition Results If the above program runs successfully, speaking into the onboard microphone will yield continuous Chinese digit recognition results, such as: diff --git a/docs/doc/en/audio/keyword.md b/docs/doc/en/audio/keyword.md index 8f6d30b1..530c9f7f 100644 --- a/docs/doc/en/audio/keyword.md +++ b/docs/doc/en/audio/keyword.md @@ -13,7 +13,7 @@ update: ## Maix-Speech -[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project. +[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md). ## Keyword recognition @@ -21,7 +21,7 @@ update: from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) kw_tbl = ['xiao3 ai4 tong2 xue2', 'ni3 hao3', @@ -39,7 +39,6 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` @@ -62,10 +61,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud") 3. Choose the corresponding audio device ```python -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) +speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device ``` -- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices. +- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input. ```python speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input @@ -81,11 +81,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav ``` -- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache: - +- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache: ```python -speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") +speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") ``` 4. Set up the decoder @@ -103,7 +102,7 @@ def callback(data:list[float], len: int): speech.kws(kw_tbl, kw_gate, callback, True) ``` -- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `kws` decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation. For other decoder usages, please refer to the sections on Real-time voice recognition and continuous Chinese numeral recognition. +- The user can configure multiple decoders simultaneously. `kws` decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation. - When setting up the `kws` decoder, you need to provide a `keyword list` separated by spaces in Pinyin, a `keyword probability threshold list` arranged in order, and specify whether to enable `automatic near-sound processing`. If set to `True`, different tones of the same Pinyin will be treated as similar words to accumulate probabilities. Finally, you need to set a callback function to handle the decoded data. @@ -114,7 +113,11 @@ similar_char = ['zhen3', 'zheng3'] speech.similar('zen3', similar_char) ``` -- After registering the decoder, use the `speech.deinit()` method to clear the initialization. +- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method. + +```python +speech.dec_deinit(nn.SpeechDecoder.DECODER_KWS) +``` 5. Recognition @@ -123,12 +126,15 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` - Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread. +- To clear the cache of recognized results, you can use the `speech.clear` method. + +- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results. + ### Recognition Results If the above program runs successfully, speaking into the onboard microphone will yield keyword recognition results, such as: diff --git a/docs/doc/en/audio/recognize.md b/docs/doc/en/audio/recognize.md index e300341d..9d7040a8 100644 --- a/docs/doc/en/audio/recognize.md +++ b/docs/doc/en/audio/recognize.md @@ -13,7 +13,7 @@ update: ## Maix-Speech -[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project. +[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md). ## Continuous Large Vocabulary Speech Recognition @@ -21,7 +21,7 @@ update: from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) def callback(data: tuple[str, str], len: int): print(data) @@ -36,7 +36,6 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` @@ -59,10 +58,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud") 3. Choose the corresponding audio device ```python -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) +speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device ``` -- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices. +- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input. ```python speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input @@ -78,11 +78,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav ``` -- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache: - +- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache: ```python -speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") +speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") ``` 4. Set up the decoder @@ -97,11 +96,15 @@ speech.lvcsr(lmS_path + "lg_6m.sfst", lmS_path + "lg_6m.sym", \ lmS_path + "phones.bin", lmS_path + "words_utf.bin", \ callback) ``` -- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `lvcsr` decoder is registered to output continuous speech recognition results (for fewer than 1024 Chinese characters). For other decoder usages, please refer to the sections on continuous Chinese numeral recognition and keyword recognition. +- The user can configure multiple decoders simultaneously. `lvcsr` decoder is registered to output continuous speech recognition results (for fewer than 1024 Chinese characters). - When setting up the `lvcsr` decoder, you need to specify the paths for the `sfst` file, the `sym` file (output symbol table), the path for `phones.bin` (phonetic table), and the path for `words.bin` (dictionary). Lastly, a callback function must be set to handle the decoded data. -- After registering the decoder, use the `speech.deinit()` method to clear the initialization. +- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method. + +```python +speech.dec_deinit(nn.SpeechDecoder.DECODER_LVCSR) +``` 5. Recognition @@ -110,12 +113,15 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` - Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread. +- To clear the cache of recognized results, you can use the `speech.clear` method. + +- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results. + ### Recognition Results If the above program runs successfully, speaking into the onboard microphone will yield real-time speech recognition results, such as: diff --git a/docs/doc/zh/audio/digit.md b/docs/doc/zh/audio/digit.md index ea0ba202..5a954ae4 100644 --- a/docs/doc/zh/audio/digit.md +++ b/docs/doc/zh/audio/digit.md @@ -13,7 +13,7 @@ update: ## Maix-Speech -[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是专为嵌入式环境设计的离线语音库,其针对语音识别算法进行了深度优化,在内存占用上达到了数量级上的领先,并且保持了优良的WER。如果想了解原理可查看该开源项目。 +[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是一款专为嵌入式环境设计的离线语音识别库,针对语音识别算法进行了深度优化,显著降低内存占用,同时在识别准确率方面表现优异。详细说明请参考 [Maix-Speech 使用文档](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md)。 ## 连续中文数字识别 @@ -21,7 +21,7 @@ update: from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) def callback(data: str, len: int): print(data) @@ -32,7 +32,6 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` @@ -55,10 +54,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud") 3. 选择对应的音频设备 ```python -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) +speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # 指定音频输入设备 ``` -- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入设备 +- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入 ```python speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # 使用 WAV 音频输入 @@ -74,11 +74,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # 使用 PCM 音频 arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav ``` -- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.devive` 方法,内部会自动进行缓存清除操作: - +- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.device` 方法,内部会自动进行缓存清除操作: ```python -speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") +speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") ``` 4. 设置解码器 @@ -89,11 +88,15 @@ def callback(data: str, len: int): speech.digit(640, callback) ``` -- 用户可以注册若干个解码器(也可以不注册),解码器的作用是解码声学模型的结果,并执行对应的用户回调。这里注册了一个 `digit` 解码器用于输出最近4s内的中文数字识别结果。返回的识别结果为字符串形式,支持 `0123456789 .(点) S(十) B(百) Q(千) W(万)`。对于其他解码器的使用可以查看语音实时识别和关键词识别部分 +- 用户可以同时设置多个解码器,`digit` 解码器的作用是输出最近4s内的中文数字识别结果。返回的识别结果为字符串形式,支持 `0123456789 .(点) S(十) B(百) Q(千) W(万)`。 - 设置 `digit` 解码器时需要设置 `blank` 值,超过该值(ms)则在输出结果里插入一个 `_` 表示空闲静音 -- 在注册完解码器后需要使用 `speech.deinit()` 方法清除初始化 +- 如果不再需要使用某个解码器,可以通过调用 `speech.dec_deinit` 方法进行解除初始化。 + +```python +speech.dec_deinit(nn.SpeechDecoder.DECODER_DIG) +``` 5. 识别 @@ -102,12 +105,15 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` - 使用 `speech.run` 方法运行语音识别,传入的参数为每次运行的帧数,返回实际运行的帧数。用户可以选择每次运行1帧后进行其他处理,或在一个线程中持续运行,使用外部线程进行停止。 +- 若需清除已识别结果的缓存,可以使用 `speech.clear` 方法。 + +- 在识别过程中切换解码器,切换后的第一帧可能会出现识别错误。可以使用 `speech.skip_frames(1)` 跳过第一帧,确保后续结果准确。 + ### 识别结果 如果上述程序运行正常,对板载麦克风说话,会得到连续中文数字识别结果,如: diff --git a/docs/doc/zh/audio/keyword.md b/docs/doc/zh/audio/keyword.md index b5d172f7..d0d8a93f 100644 --- a/docs/doc/zh/audio/keyword.md +++ b/docs/doc/zh/audio/keyword.md @@ -13,7 +13,7 @@ update: ## Maix-Speech -[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是专为嵌入式环境设计的离线语音库,其针对语音识别算法进行了深度优化,在内存占用上达到了数量级上的领先,并且保持了优良的WER。如果想了解原理可查看该开源项目。 +[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是一款专为嵌入式环境设计的离线语音识别库,针对语音识别算法进行了深度优化,显著降低内存占用,同时在识别准确率方面表现优异。详细说明请参考 [Maix-Speech 使用文档](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md)。 ## 关键词识别 @@ -21,7 +21,7 @@ update: from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) kw_tbl = ['xiao3 ai4 tong2 xue2', 'ni3 hao3', @@ -39,7 +39,6 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` @@ -62,10 +61,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud") 3. 选择对应的音频设备 ```python -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) +speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # 指定音频输入设备 ``` -- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入设备 +- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入 ```python speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # 使用 WAV 音频输入 @@ -81,11 +81,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # 使用 PCM 音频 arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav ``` -- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.devive` 方法,内部会自动进行缓存清除操作: - +- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.device` 方法,内部会自动进行缓存清除操作: ```python -speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") +speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") ``` 4. 设置解码器 @@ -103,7 +102,7 @@ def callback(data:list[float], len: int): speech.kws(kw_tbl, kw_gate, callback, True) ``` -- 用户可以注册若干个解码器(也可以不注册),解码器的作用是解码声学模型的结果,并执行对应的用户回调。这里注册了一个 `kws` 解码器用于输出最近一帧所有注册的关键词的概率列表,用户可以观察概率值,自行设定阈值进行唤醒。对于其他解码器的使用可以查看语音实时识别和连续中文数字识别部分 +- 用户可以同时设置多个解码器,`kws` 解码器用于输出最近一帧所有注册的关键词的概率列表,用户可以观察概率值,自行设定阈值进行唤醒。 - 设置 `kws` 解码器时需要设置 `关键词列表`,以拼音间隔空格填写,`关键词概率门限表`,按顺序排列输入即可,是否进行 `自动近音处理`,设置为 `True` 则会自动将不同声调的拼音作为近音词来合计概率。最后还要设置一个回调函数用于处理解码出的数据。 @@ -114,7 +113,11 @@ similar_char = ['zhen3', 'zheng3'] speech.similar('zen3', similar_char) ``` -- 在注册完解码器后需要使用 `speech.deinit()` 方法清除初始化 +- 如果不再需要使用某个解码器,可以通过调用 `speech.dec_deinit` 方法进行解除初始化。 + +```python +speech.dec_deinit(nn.SpeechDecoder.DECODER_KWS) +``` 5. 识别 @@ -123,12 +126,15 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` - 使用 `speech.run` 方法运行语音识别,传入的参数为每次运行的帧数,返回实际运行的帧数。用户可以选择每次运行1帧后进行其他处理,或在一个线程中持续运行,使用外部线程进行停止。 +- 若需清除已识别结果的缓存,可以使用 `speech.clear` 方法。 + +- 在识别过程中切换解码器,切换后的第一帧可能会出现识别错误。可以使用 `speech.skip_frames(1)` 跳过第一帧,确保后续结果准确。 + ### 识别结果 如果上述程序运行正常,对板载麦克风说话,会得到关键词识别结果,如: diff --git a/docs/doc/zh/audio/recognize.md b/docs/doc/zh/audio/recognize.md index d955cf41..e352748b 100644 --- a/docs/doc/zh/audio/recognize.md +++ b/docs/doc/zh/audio/recognize.md @@ -13,7 +13,7 @@ update: ## Maix-Speech -[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是专为嵌入式环境设计的离线语音库,其针对语音识别算法进行了深度优化,在内存占用上达到了数量级上的领先,并且保持了优良的WER。如果想了解原理可查看该开源项目。 +[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是一款专为嵌入式环境设计的离线语音识别库,针对语音识别算法进行了深度优化,显著降低内存占用,同时在识别准确率方面表现优异。详细说明请参考 [Maix-Speech 使用文档](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md)。 ## 连续大词汇量语音识别 @@ -21,7 +21,7 @@ update: from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) def callback(data: tuple[str, str], len: int): print(data) @@ -36,7 +36,6 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` @@ -59,10 +58,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud") 3. 选择对应的音频设备 ```python -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) +speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # 指定音频输入设备 ``` -- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入设备 +- 这里使用的是板载的麦克风,也选择 `WAV` 和 `PCM` 音频作为输入 ```python speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # 使用 WAV 音频输入 @@ -78,11 +78,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # 使用 PCM 音频 arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav ``` -- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.devive` 方法,内部会自动进行缓存清除操作: - +- 在 `PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.device` 方法,内部会自动进行缓存清除操作: ```python -speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") +speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav") ``` 4. 设置解码器 @@ -97,11 +96,15 @@ speech.lvcsr(lmS_path + "lg_6m.sfst", lmS_path + "lg_6m.sym", \ lmS_path + "phones.bin", lmS_path + "words_utf.bin", \ callback) ``` -- 用户可以注册若干个解码器(也可以不注册),解码器的作用是解码声学模型的结果,并执行对应的用户回调。这里注册了一个 `lvcsr` 解码器用于输出连续语音识别结果(小于1024个汉字结果)。对于其他解码器的使用可以查看连续中文数字识别和关键词识别部分 +- 用户可以同时设置多个解码器,`lvcsr` 解码器用于输出连续语音识别结果(小于1024个汉字结果)。 - 设置 `lvcsr` 解码器时需要设置 `sfst` 文件路径,`sym` 文件路径(输出符号表),`phones.bin` 的路径(拼音表),和 `words.bin` 的路径(词典表)。最后还要设置一个回调函数用于处理解码出的数据。 -- 在注册完解码器后需要使用 `speech.deinit()` 方法清除初始化 +- 如果不再需要使用某个解码器,可以通过调用 `speech.dec_deinit` 方法进行解除初始化。 + +```python +speech.dec_deinit(nn.SpeechDecoder.DECODER_LVCSR) +``` 5. 识别 @@ -110,12 +113,15 @@ while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break ``` - 使用 `speech.run` 方法运行语音识别,传入的参数为每次运行的帧数,返回实际运行的帧数。用户可以选择每次运行1帧后进行其他处理,或在一个线程中持续运行,使用外部线程进行停止。 +- 若需清除已识别结果的缓存,可以使用 `speech.clear` 方法。 + +- 在识别过程中切换解码器,切换后的第一帧可能会出现识别错误。可以使用 `speech.skip_frames(1)` 跳过第一帧,确保后续结果准确。 + ### 识别结果 如果上述程序运行正常,对板载麦克风说话,会得到实时语言识别结果,如: diff --git a/examples/audio/asr/asr_digit.py b/examples/audio/asr/asr_digit.py index d4435f2b..2c573c3a 100644 --- a/examples/audio/asr/asr_digit.py +++ b/examples/audio/asr/asr_digit.py @@ -1,16 +1,15 @@ from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) def callback(data: str, len: int): print(data) -speech.digit(640, digit_callback) +speech.digit(640, callback) while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break diff --git a/examples/audio/asr/asr_kws.py b/examples/audio/asr/asr_kws.py index 1fa1e875..0f66a534 100644 --- a/examples/audio/asr/asr_kws.py +++ b/examples/audio/asr/asr_kws.py @@ -1,7 +1,7 @@ from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) kw_tbl = ['xiao3 ai4 tong2 xue2', 'ni3 hao3', @@ -19,5 +19,4 @@ def callback(data:list[float], len: int): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break diff --git a/examples/audio/asr/asr_lvcsr.py b/examples/audio/asr/asr_lvcsr.py index 557b1a7b..dd0514ac 100644 --- a/examples/audio/asr/asr_lvcsr.py +++ b/examples/audio/asr/asr_lvcsr.py @@ -1,7 +1,7 @@ from maix import app, nn speech = nn.Speech("/root/models/am_3332_192_int8.mud") -speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") +speech.init(nn.SpeechDevice.DEVICE_MIC) def callback(data: tuple[str, str], len: int): print(data) @@ -10,11 +10,10 @@ def callback(data: tuple[str, str], len: int): speech.lvcsr(lmS_path + "lg_6m.sfst", lmS_path + "lg_6m.sym", \ lmS_path + "phones.bin", lmS_path + "words_utf.bin", \ - my_lvcsrcb) + callback) while not app.need_exit(): frames = speech.run(1) if frames < 1: print("run out\n") - speech.deinit() break