sipeed
diff --git a/‎docs/doc/en/audio/recognize.md‎
Lines changed: 3 additions & 59 deletions b/‎docs/doc/en/audio/recognize.md‎
Lines changed: 3 additions & 59 deletions
diff --git a/‎docs/doc/en/audio/synthesis.md‎
Lines changed: 2 additions & 52 deletions b/‎docs/doc/en/audio/synthesis.md‎
Lines changed: 2 additions & 52 deletions
diff --git a/‎docs/doc/en/mllm/asr_sensevoice.md‎
Lines changed: 96 additions & 0 deletions b/‎docs/doc/en/mllm/asr_sensevoice.md‎
Lines changed: 96 additions & 0 deletions
diff --git a/‎docs/doc/en/mllm/asr_whisper.md‎
Lines changed: 65 additions & 0 deletions b/‎docs/doc/en/mllm/asr_whisper.md‎
Lines changed: 65 additions & 0 deletions
@@ -20,74 +20,18 @@ Speech recognition model support list:
 |         | MaixCAM | MaixCAM Pro | MaixCAM2 |
 | ------- | ------- | ----------- | -------- |
 | Whisper | ❌       | ❌           | ✅        |
+| SenseVoice | ❌       | ❌           | ✅        |
 | Speech  | ✅       | ✅           | ❌        |
 
 In addition, we have ported OpenAI's Whisper speech recognition model to the `MaixCAM2`, enabling powerful speech-to-text functionality even on resource-constrained devices.
 
 ## Using Whisper for Speech-to-Text
 
-> Note: MaixCAM and MaixCAM Pro do not support the Whisper model.
-
-Currently, only the base version of the Whisper model is supported. It accepts single-channel WAV audio with a 16kHz sample rate and can recognize both Chinese and English.
-
-```python
-from maix import nn
-
-whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base.mud")
-
-wav_path = "/maixapp/share/audio/demo.wav"
-
-res = whisper.transcribe(wav_path)
-
-print('whisper:', res)
-```
-
-Notes:
-1. First, import the nn module to create the Whisper model object:
-```python
-from maix import nn
-```
-2. Load the model. Currently, only the `base` version is supported:
-```python
-whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base.mud", language="en")
-```
-3. Prepare a WAV audio file with 1 channel and 16kHz sample rate, and run inference. The result will be returned directly:
-```python
-wav_path = "/maixapp/share/audio/demo.wav"
-res = whisper.forward(wav_path)
-print('whisper:', res)
-```
-4. Sample output:
-```shell
-whisper: Have fun exploring!
-```
-5. Give it a try yourself!
-
-By default, it recognizes Chinese. To recognize English, pass the language parameter when initializing the object.
-```python
-whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base.mud", language="en")
-```
+The usage of Whisper can be found in [Whisper Speech Recognition Model](../mllm/asr_whisper.md)
 
 ## Using SenseVoice for Speech-to-Text
 
-Currently, only `MaixCAM2` supports `SenseVoice`, and all SenseVoice-related code is implemented in `Python`, so only Python-side examples are provided.
-By default, the system does not include the `SenseVoice` model. Please download it from [here](https://huggingface.co/sipeed/sensevoice-maixcam2) and place it in the `/root/models/` directory.
-
-Before using it, you need to start the `sensevoice.service `service. The command is as follows:
-> Note that sensevoice.service starts from the `/root/models/sensevoice-maixcam2` directory by default, so make sure the model is placed under `/root/models/.`
-```shell
-systemctl start sensevoice.service
-```
-
-You can also start it manually:
-
-```shell
-cd /root/models/sensevoice-maixcam2
-python server.py
-```
-
-After the service is started, you can perform speech recognition via HTTP interaction.
-For usage, please refer to the example:[asr_sensevoice.py](https://github.com/sipeed/MaixPy/tree/main/examples/audio/asr/sensevoice/asr_sensevoice.py)
+The usage of Sensevoice can be found in [Sensevoice Speech Recognition Model](../mllm/asr_sensevoice.md)
 
 ## Maix-Speech
 
 
@@ -22,56 +22,6 @@ TTS Support List:
 TTS (Text-to-Speech) converts text into speech. You can write a piece of text and feed it to a TTS-supported model. After running the model, it will output an audio data containing the spoken version of the text.
 In practice, TTS is commonly used for video dubbing, navigation guidance, public announcements, and more. Simply put, TTS is “technology that reads text aloud.”
 
-## MelloTTS
+## MeloTTS
 
-MeloTTS is a high-quality multilingual text-to-speech library jointly developed by MIT and MyShell.ai. Currently, it supports the mellotts-zh model, which can synthesize both Chinese and English speech. However, English synthesis is not yet optimal.
-
-The default output audio is PCM data with a sample rate of 44100 Hz, single channel, and 16-bit depth.
-
-> Sample rate: The number of times sound is sampled per second.
->
-> Channels: The number of audio channels captured per sample. Single channel means mono audio, and dual channel means stereo (left and right channels). To reduce AI inference complexity, single-channel audio is generally used.
->
-> Bit depth: The data range captured per sample. A 16-bit depth usually represents each sample as a 16-bit signed integer. Higher bit depth captures finer audio details.
-
-```python
-from maix import nn, audio
-
-# Only MaixCAM2 supports this model.
-sample_rate = 44100
-p = audio.Player(sample_rate=sample_rate)
-p.volume(80)
-
-melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
-
-pcm = melotts.infer('hello', output_pcm=True)
-p.play(pcm)
-```
-
-
-注：
-1. Import the nn module first to create a MeloTTS model object:
-```python
-from maix import nn
-```
-2. Choose the model to load. currently, the melotts-zh model is supported:
-   - `speed` sets the playback speed
-   - `language` sets the language type
-```python
-melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
-```
-3. Start inference:
-   - The text to infer here is 'hello'
-   - Set `output_pcm=True` to return PCM data
-```python
-pcm = melotts.infer('hello', output_pcm=True)
-```
-4. Use the audio playback module to play the generated audio:
-   - Make sure the sample rate matches the model’s output
-   - Use `p.volume(80)` to control the output volume (range: 0–100)
-   - Play the PCM generated by MeloTTS with `p.play(pcm)`
-```shell
-p = audio.Player(sample_rate=sample_rate)
-p.volume(80)
-p.play(pcm)
-```
+The usage of MeloTTS can be found in [MeloTTS Text to Speech Model](../mllm/tts_melotts.md).
@@ -0,0 +1,96 @@
+---
+title: Running the SenseVoice Model on MaixPy MaixCAM
+update:
+  - date: 2026-01-05
+    author: lxowalle
+    version: 1.0.0
+    content: Added SenseVoice documentation
+---
+
+## SenseVoice Model Overview
+
+SenseVoice is a multilingual audio recognition model that supports Chinese, English, Cantonese, Japanese, and Korean. It provides features including speech recognition, automatic language detection, emotion recognition, automatic punctuation, and streaming recognition.
+
+## Downloading the Model
+
+Supported models:
+
+| Model                                                         | Platform     | Memory Requirement | Description              |
+| ------------------------------------------------------------ | -------- | -------- | ----------------- |
+| [sensevoice-maixcam2](https://huggingface.co/sipeed/sensevoice-maixcam2) | MaixCAM2 | 1G       | |
+
+Refer to the [Large Model User Guide](./basic.md) to download the model.
+
+## Running the Model with MaixPy
+
+> Note: MaixPy version `4.12.3` or later is required
+
+### Non-Streaming Recognition
+
+```python
+from maix import sensevoice
+
+model_path = "/root/models/sensevoice-maixcam2"
+client = sensevoice.Sensevoice(model=model_path+"/model.mud", stream=False)
+client.start()
+if client.is_ready(block=True) is False:
+    print("Failed to start service or model.")
+    exit()
+
+audio_file = "/maixapp/share/audio/demo.wav"
+text = client.refer(path=audio_file)
+print(text)
+
+# You can comment out this line of code, which will save time on the next startup. 
+# But it will cause the background service to continuously occupy CMM memory.
+client.stop()
+```
+
+Output:
+
+```shell
+开始愉快的探索吧。
+```
+
+Explanation:
+- When creating the `sensevoice.Sensevoice` object, setting `stream=False` enables non-streaming recognition. The interface will wait until recognition is complete and then return the result at once.
+- When the `refer` function is called with the `path` parameter, it recognizes an audio file. Currently, only the `wav` format is supported. Audio format requirements: `16,000` Hz sample rate, mono channel, 16-bit width.
+- When the `refer` function is called with the `audio_data` parameter, it recognizes `bytes-type PCM` data. Audio format requirements are the same: `16,000` Hz sample rate, mono channel, 16-bit width.
+- The start function starts the `SenseVoice` background service, and the `stop` function stops it. Running `SenseVoice` as a background service allows multi-process operation and prevents the foreground application from being blocked during model execution.
+
+### Streaming Recognition
+
+```python
+from maix import sensevoice
+
+model_path = "/root/models/sensevoice-maixcam2"
+client = sensevoice.Sensevoice(model=model_path+"/model.mud", stream=True)
+client.start()
+if client.is_ready(block=True) is False:
+    print("Failed to start service or model.")
+    exit()
+
+audio_file = "/maixapp/share/audio/demo.wav"
+print('start refer stream')
+for text in client.refer_stream(path=audio_file):
+    print(text)
+
+# You can comment out this line of code, which will save time on the next startup. 
+# But it will cause the background service to continuously occupy CMM memory.
+client.stop()
+```
+Output:
+
+```shell
+开始愉快
+开始愉快的探索
+开始愉快的探索吧
+```
+
+Explanation:
+- When creating the `sensevoice.Sensevoice` object, setting `stream=True` enables streaming recognition. Partial recognition results are returned immediately as they become available, until the entire audio is processed.
+- Other behaviors are the same as described above.
+
+### Real-Time Speech Recognition via Microphone
+
+In practical development, you may need to capture audio data from a microphone and pass it to the model for speech-to-text processing. Please refer to the example:[asr_sensevoice.py](https://github.com/sipeed/MaixPy/tree/main/examples/audio/asr/sensevoice/asr_sensevoice.py)
@@ -0,0 +1,65 @@
+---
+title: Running the Whisper Model on MaixPy MaixCAM
+update:
+  - date: 2026-01-05
+    author: lxowalle
+    version: 1.0.0
+    content: Added Whisper documentation
+---
+
+## Whisper Model Overview
+
+Whisper is a general-purpose speech recognition model open-sourced by OpenAI, designed for tasks such as multilingual speech recognition and speech translation.
+Currently, the Whisper model ported to MaixCAM2 is the `base` version. It supports input WAV audio files with mono channel and 16 kHz sample rate, and can recognize Chinese and English.
+
+## Downloading the Model
+
+Supported models:
+
+| Model                                                         | Platform     | Memory Requirement | Description              |
+| ------------------------------------------------------------ | -------- | -------- | ----------------- |
+| [whisper-base-maixcam2](https://huggingface.co/sipeed/whisper-base-maixcam2) | MaixCAM2 | 1G      | base |
+
+Refer to the [Large Model User Guide](./basic.md) to download the model.
+
+## Running the Model with MaixPy
+
+Currently, only the base-size Whisper model is supported. It accepts mono, 16 kHz WAV audio files and supports Chinese and English recognition.
+Below is a simple example demonstrating how to use Whisper for speech recognition:
+
+```python
+from maix import nn
+
+whisper = nn.Whisper(model="/root/models/whisper-base-maixcam2/whisper-base.mud")
+
+wav_path = "/maixapp/share/audio/demo.wav"
+
+res = whisper.transcribe(wav_path)
+
+print('res:', res)
+```
+
+Notes:
+1. First, import the nn module to create a Whisper model object:
+```python
+from maix import nn
+```
+2. Select the model to load. Currently, only the base-size Whisper model is supported:
+```python
+whisper = nn.Whisper(model="/root/models/whisper-base-maixcam2/whisper-base.mud")
+```
+3. Prepare a mono, 16 kHz WAV audio file and run inference. The recognition result will be returned directly:
+```python
+wav_path = "/maixapp/share/audio/demo.wav"
+res = whisper.forward(wav_path)
+print('whisper:', res)
+```
+4. Output result:
+```shell
+whisper: 开始愉快的探索吧
+```
+
+By default, the model recognizes Chinese.
+To recognize English, specify the `language` parameter when initializing the object:
+```python
+whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base-maixcam2.mud", language="en")