Skip to content

Commit fa058e8

Browse files
committed
* optimize sdv1-5 method
* optimize large model docs
1 parent 8ad058d commit fa058e8

40 files changed

+1987
-686
lines changed

docs/doc/en/audio/recognize.md

Lines changed: 3 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -20,74 +20,18 @@ Speech recognition model support list:
2020
| | MaixCAM | MaixCAM Pro | MaixCAM2 |
2121
| ------- | ------- | ----------- | -------- |
2222
| Whisper ||||
23+
| SenseVoice ||||
2324
| Speech ||||
2425

2526
In addition, we have ported OpenAI's Whisper speech recognition model to the `MaixCAM2`, enabling powerful speech-to-text functionality even on resource-constrained devices.
2627

2728
## Using Whisper for Speech-to-Text
2829

29-
> Note: MaixCAM and MaixCAM Pro do not support the Whisper model.
30-
31-
Currently, only the base version of the Whisper model is supported. It accepts single-channel WAV audio with a 16kHz sample rate and can recognize both Chinese and English.
32-
33-
```python
34-
from maix import nn
35-
36-
whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base.mud")
37-
38-
wav_path = "/maixapp/share/audio/demo.wav"
39-
40-
res = whisper.transcribe(wav_path)
41-
42-
print('whisper:', res)
43-
```
44-
45-
Notes:
46-
1. First, import the nn module to create the Whisper model object:
47-
```python
48-
from maix import nn
49-
```
50-
2. Load the model. Currently, only the `base` version is supported:
51-
```python
52-
whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base.mud", language="en")
53-
```
54-
3. Prepare a WAV audio file with 1 channel and 16kHz sample rate, and run inference. The result will be returned directly:
55-
```python
56-
wav_path = "/maixapp/share/audio/demo.wav"
57-
res = whisper.forward(wav_path)
58-
print('whisper:', res)
59-
```
60-
4. Sample output:
61-
```shell
62-
whisper: Have fun exploring!
63-
```
64-
5. Give it a try yourself!
65-
66-
By default, it recognizes Chinese. To recognize English, pass the language parameter when initializing the object.
67-
```python
68-
whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base.mud", language="en")
69-
```
30+
The usage of Whisper can be found in [Whisper Speech Recognition Model](../mllm/asr_whisper.md)
7031

7132
## Using SenseVoice for Speech-to-Text
7233

73-
Currently, only `MaixCAM2` supports `SenseVoice`, and all SenseVoice-related code is implemented in `Python`, so only Python-side examples are provided.
74-
By default, the system does not include the `SenseVoice` model. Please download it from [here](https://huggingface.co/sipeed/sensevoice-maixcam2) and place it in the `/root/models/` directory.
75-
76-
Before using it, you need to start the `sensevoice.service `service. The command is as follows:
77-
> Note that sensevoice.service starts from the `/root/models/sensevoice-maixcam2` directory by default, so make sure the model is placed under `/root/models/.`
78-
```shell
79-
systemctl start sensevoice.service
80-
```
81-
82-
You can also start it manually:
83-
84-
```shell
85-
cd /root/models/sensevoice-maixcam2
86-
python server.py
87-
```
88-
89-
After the service is started, you can perform speech recognition via HTTP interaction.
90-
For usage, please refer to the example:[asr_sensevoice.py](https://github.com/sipeed/MaixPy/tree/main/examples/audio/asr/sensevoice/asr_sensevoice.py)
34+
The usage of Sensevoice can be found in [Sensevoice Speech Recognition Model](../mllm/asr_sensevoice.md)
9135

9236
## Maix-Speech
9337

docs/doc/en/audio/synthesis.md

Lines changed: 2 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -22,56 +22,6 @@ TTS Support List:
2222
TTS (Text-to-Speech) converts text into speech. You can write a piece of text and feed it to a TTS-supported model. After running the model, it will output an audio data containing the spoken version of the text.
2323
In practice, TTS is commonly used for video dubbing, navigation guidance, public announcements, and more. Simply put, TTS is “technology that reads text aloud.”
2424

25-
## MelloTTS
25+
## MeloTTS
2626

27-
MeloTTS is a high-quality multilingual text-to-speech library jointly developed by MIT and MyShell.ai. Currently, it supports the mellotts-zh model, which can synthesize both Chinese and English speech. However, English synthesis is not yet optimal.
28-
29-
The default output audio is PCM data with a sample rate of 44100 Hz, single channel, and 16-bit depth.
30-
31-
> Sample rate: The number of times sound is sampled per second.
32-
>
33-
> Channels: The number of audio channels captured per sample. Single channel means mono audio, and dual channel means stereo (left and right channels). To reduce AI inference complexity, single-channel audio is generally used.
34-
>
35-
> Bit depth: The data range captured per sample. A 16-bit depth usually represents each sample as a 16-bit signed integer. Higher bit depth captures finer audio details.
36-
37-
```python
38-
from maix import nn, audio
39-
40-
# Only MaixCAM2 supports this model.
41-
sample_rate = 44100
42-
p = audio.Player(sample_rate=sample_rate)
43-
p.volume(80)
44-
45-
melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
46-
47-
pcm = melotts.infer('hello', output_pcm=True)
48-
p.play(pcm)
49-
```
50-
51-
52-
注:
53-
1. Import the nn module first to create a MeloTTS model object:
54-
```python
55-
from maix import nn
56-
```
57-
2. Choose the model to load. currently, the melotts-zh model is supported:
58-
- `speed` sets the playback speed
59-
- `language` sets the language type
60-
```python
61-
melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
62-
```
63-
3. Start inference:
64-
- The text to infer here is 'hello'
65-
- Set `output_pcm=True` to return PCM data
66-
```python
67-
pcm = melotts.infer('hello', output_pcm=True)
68-
```
69-
4. Use the audio playback module to play the generated audio:
70-
- Make sure the sample rate matches the model’s output
71-
- Use `p.volume(80)` to control the output volume (range: 0–100)
72-
- Play the PCM generated by MeloTTS with `p.play(pcm)`
73-
```shell
74-
p = audio.Player(sample_rate=sample_rate)
75-
p.volume(80)
76-
p.play(pcm)
77-
```
27+
The usage of MeloTTS can be found in [MeloTTS Text to Speech Model](../mllm/tts_melotts.md).

docs/doc/en/mllm/asr_sensevoice.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
title: Running the SenseVoice Model on MaixPy MaixCAM
3+
update:
4+
- date: 2026-01-05
5+
author: lxowalle
6+
version: 1.0.0
7+
content: Added SenseVoice documentation
8+
---
9+
10+
## SenseVoice Model Overview
11+
12+
SenseVoice is a multilingual audio recognition model that supports Chinese, English, Cantonese, Japanese, and Korean. It provides features including speech recognition, automatic language detection, emotion recognition, automatic punctuation, and streaming recognition.
13+
14+
## Downloading the Model
15+
16+
Supported models:
17+
18+
| Model | Platform | Memory Requirement | Description |
19+
| ------------------------------------------------------------ | -------- | -------- | ----------------- |
20+
| [sensevoice-maixcam2](https://huggingface.co/sipeed/sensevoice-maixcam2) | MaixCAM2 | 1G | |
21+
22+
Refer to the [Large Model User Guide](./basic.md) to download the model.
23+
24+
## Running the Model with MaixPy
25+
26+
> Note: MaixPy version `4.12.3` or later is required
27+
28+
### Non-Streaming Recognition
29+
30+
```python
31+
from maix import sensevoice
32+
33+
model_path = "/root/models/sensevoice-maixcam2"
34+
client = sensevoice.Sensevoice(model=model_path+"/model.mud", stream=False)
35+
client.start()
36+
if client.is_ready(block=True) is False:
37+
print("Failed to start service or model.")
38+
exit()
39+
40+
audio_file = "/maixapp/share/audio/demo.wav"
41+
text = client.refer(path=audio_file)
42+
print(text)
43+
44+
# You can comment out this line of code, which will save time on the next startup.
45+
# But it will cause the background service to continuously occupy CMM memory.
46+
client.stop()
47+
```
48+
49+
Output:
50+
51+
```shell
52+
开始愉快的探索吧。
53+
```
54+
55+
Explanation:
56+
- When creating the `sensevoice.Sensevoice` object, setting `stream=False` enables non-streaming recognition. The interface will wait until recognition is complete and then return the result at once.
57+
- When the `refer` function is called with the `path` parameter, it recognizes an audio file. Currently, only the `wav` format is supported. Audio format requirements: `16,000` Hz sample rate, mono channel, 16-bit width.
58+
- When the `refer` function is called with the `audio_data` parameter, it recognizes `bytes-type PCM` data. Audio format requirements are the same: `16,000` Hz sample rate, mono channel, 16-bit width.
59+
- The start function starts the `SenseVoice` background service, and the `stop` function stops it. Running `SenseVoice` as a background service allows multi-process operation and prevents the foreground application from being blocked during model execution.
60+
61+
### Streaming Recognition
62+
63+
```python
64+
from maix import sensevoice
65+
66+
model_path = "/root/models/sensevoice-maixcam2"
67+
client = sensevoice.Sensevoice(model=model_path+"/model.mud", stream=True)
68+
client.start()
69+
if client.is_ready(block=True) is False:
70+
print("Failed to start service or model.")
71+
exit()
72+
73+
audio_file = "/maixapp/share/audio/demo.wav"
74+
print('start refer stream')
75+
for text in client.refer_stream(path=audio_file):
76+
print(text)
77+
78+
# You can comment out this line of code, which will save time on the next startup.
79+
# But it will cause the background service to continuously occupy CMM memory.
80+
client.stop()
81+
```
82+
Output:
83+
84+
```shell
85+
开始愉快
86+
开始愉快的探索
87+
开始愉快的探索吧
88+
```
89+
90+
Explanation:
91+
- When creating the `sensevoice.Sensevoice` object, setting `stream=True` enables streaming recognition. Partial recognition results are returned immediately as they become available, until the entire audio is processed.
92+
- Other behaviors are the same as described above.
93+
94+
### Real-Time Speech Recognition via Microphone
95+
96+
In practical development, you may need to capture audio data from a microphone and pass it to the model for speech-to-text processing. Please refer to the example:[asr_sensevoice.py](https://github.com/sipeed/MaixPy/tree/main/examples/audio/asr/sensevoice/asr_sensevoice.py)

docs/doc/en/mllm/asr_whisper.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
title: Running the Whisper Model on MaixPy MaixCAM
3+
update:
4+
- date: 2026-01-05
5+
author: lxowalle
6+
version: 1.0.0
7+
content: Added Whisper documentation
8+
---
9+
10+
## Whisper Model Overview
11+
12+
Whisper is a general-purpose speech recognition model open-sourced by OpenAI, designed for tasks such as multilingual speech recognition and speech translation.
13+
Currently, the Whisper model ported to MaixCAM2 is the `base` version. It supports input WAV audio files with mono channel and 16 kHz sample rate, and can recognize Chinese and English.
14+
15+
## Downloading the Model
16+
17+
Supported models:
18+
19+
| Model | Platform | Memory Requirement | Description |
20+
| ------------------------------------------------------------ | -------- | -------- | ----------------- |
21+
| [whisper-base-maixcam2](https://huggingface.co/sipeed/whisper-base-maixcam2) | MaixCAM2 | 1G | base |
22+
23+
Refer to the [Large Model User Guide](./basic.md) to download the model.
24+
25+
## Running the Model with MaixPy
26+
27+
Currently, only the base-size Whisper model is supported. It accepts mono, 16 kHz WAV audio files and supports Chinese and English recognition.
28+
Below is a simple example demonstrating how to use Whisper for speech recognition:
29+
30+
```python
31+
from maix import nn
32+
33+
whisper = nn.Whisper(model="/root/models/whisper-base-maixcam2/whisper-base.mud")
34+
35+
wav_path = "/maixapp/share/audio/demo.wav"
36+
37+
res = whisper.transcribe(wav_path)
38+
39+
print('res:', res)
40+
```
41+
42+
Notes:
43+
1. First, import the nn module to create a Whisper model object:
44+
```python
45+
from maix import nn
46+
```
47+
2. Select the model to load. Currently, only the base-size Whisper model is supported:
48+
```python
49+
whisper = nn.Whisper(model="/root/models/whisper-base-maixcam2/whisper-base.mud")
50+
```
51+
3. Prepare a mono, 16 kHz WAV audio file and run inference. The recognition result will be returned directly:
52+
```python
53+
wav_path = "/maixapp/share/audio/demo.wav"
54+
res = whisper.forward(wav_path)
55+
print('whisper:', res)
56+
```
57+
4. Output result:
58+
```shell
59+
whisper: 开始愉快的探索吧
60+
```
61+
62+
By default, the model recognizes Chinese.
63+
To recognize English, specify the `language` parameter when initializing the object:
64+
```python
65+
whisper = nn.Whisper(model="/root/models/whisper-base/whisper-base-maixcam2.mud", language="en")

0 commit comments

Comments
 (0)