Skip to content

Commit 3d251a4

Browse files
WEIFENG2333claudehappy-otter
committed
docs: rewrite README with complete model docs and real output examples
- Update all code examples to use new Model handle API - Add per-model documentation with real inference output - Add pipeline compatibility table (what can/cannot combine) - Document SenseVoiceSmall + ct-punc tag corruption issue - Document Fun-ASR-Nano VAD incompatibility (batch_size limitation) - Add all 19 models in registry with params and descriptions - Add input methods section (file path, bytes, text) - Remove outdated name-string API references Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
1 parent 9376555 commit 3d251a4

File tree

1 file changed

+305
-51
lines changed

1 file changed

+305
-51
lines changed

README.md

Lines changed: 305 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ No need to pre-install Python, PyTorch, or any dependencies — `funasr-server`
88

99
- **Zero-config setup** — automatically installs Python, PyTorch (CPU/CUDA/MPS), and FunASR
1010
- **Persistent server** — models stay loaded in memory, no repeated loading
11-
- **All model types** — ASR, VAD, punctuation, speaker diarization, emotion, and more
11+
- **All model types** — ASR, VAD, punctuation, speaker embedding, emotion recognition
1212
- **Cross-platform** — Linux, macOS, Windows
1313
- **China-friendly** — auto-detects network and uses Chinese mirrors when needed
1414

@@ -25,51 +25,257 @@ asr = FunASR()
2525
asr.ensure_installed() # one-time setup (~2 min)
2626
asr.start()
2727

28-
# Load and run ASR
29-
asr.load_model(model="iic/SenseVoiceSmall")
30-
result = asr.infer("audio.wav", language="zh", use_itn=True)
28+
# Load model — returns a Model handle
29+
model = asr.load_model("SenseVoiceSmall")
30+
31+
# Run inference
32+
result = model.infer(audio="audio.wav")
3133
print(result)
32-
# [{"key": "audio", "text": "你好世界"}]
34+
# [{"key": "audio", "text": "<|zh|><|NEUTRAL|><|Speech|><|woitn|>你好世界"}]
35+
36+
# Or use shorthand
37+
result = model("audio.wav")
3338

39+
model.unload()
3440
asr.stop()
3541
```
3642

3743
### Context Manager
3844

3945
```python
4046
with FunASR() as asr:
41-
asr.load_model(model="iic/SenseVoiceSmall")
42-
result = asr.infer("audio.wav")
47+
model = asr.load_model("SenseVoiceSmall")
48+
result = model("audio.wav")
4349
```
4450

45-
## Multiple Model Types
51+
## Supported Models
52+
53+
### ASR (Speech Recognition)
54+
55+
#### SenseVoiceSmall
56+
57+
Multi-task ASR with language/emotion/event detection. 234M params, supports zh/en/ja/ko/yue.
4658

4759
```python
48-
asr = FunASR()
49-
asr.ensure_installed()
50-
asr.start()
60+
model = asr.load_model("SenseVoiceSmall")
61+
result = model(audio="audio.wav")
62+
```
5163

52-
# ASR (speech recognition)
53-
asr.load_model(model="iic/SenseVoiceSmall", name="asr")
54-
result = asr.infer("audio.wav", name="asr")
55-
56-
# VAD (voice activity detection)
57-
asr.load_model(model="fsmn-vad", name="vad")
58-
result = asr.infer("audio.wav", name="vad")
59-
# [{"key": "audio", "value": [[0, 3200], [4500, 9800]]}]
60-
61-
# Full pipeline (ASR + VAD + punctuation)
62-
asr.load_model(
63-
model="paraformer-zh",
64-
vad_model="fsmn-vad",
65-
punc_model="ct-punc",
66-
name="pipeline",
67-
)
68-
result = asr.infer("audio.wav", name="pipeline")
64+
Output:
65+
```python
66+
[{"key": "audio", "text": "<|zh|><|NEUTRAL|><|Speech|><|woitn|>欢迎大家来体验达摩院推出的语音识别模型"}]
67+
```
6968

70-
asr.stop()
69+
The text field contains special tags: `<|language|><|emotion|><|event|><|itn|>text`.
70+
71+
**Inference parameters:**
72+
73+
| Parameter | Type | Description |
74+
|-----------|------|-------------|
75+
| `language` | `str` | Language hint: `"zh"`, `"en"`, `"ja"`, `"ko"`, `"yue"` |
76+
| `use_itn` | `bool` | Enable inverse text normalization (adds punctuation, tag changes to `<\|withitn\|>`) |
77+
| `batch_size` | `int` | Batch size for processing multiple files |
78+
79+
```python
80+
# With ITN enabled — adds punctuation
81+
model = asr.load_model("SenseVoiceSmall")
82+
result = model(audio="audio.wav", use_itn=True)
83+
# [{"key": "audio", "text": "<|zh|><|NEUTRAL|><|Speech|><|withitn|>欢迎大家来体验达摩院推出的语音识别模型。"}]
84+
```
85+
86+
> **Note:** SenseVoiceSmall can be combined with `vad_model="fsmn-vad"` to process long audio. Do NOT combine with `punc_model="ct-punc"` — the punctuation model will corrupt the special tags in the output.
87+
88+
#### Fun-ASR-Nano
89+
90+
End-to-end ASR with built-in punctuation and timestamps. 800M params, supports zh (7 dialects, 26 accents) + en + ja.
91+
92+
```python
93+
nano = asr.load_model("Fun-ASR-Nano")
94+
result = nano(audio="audio.wav")
95+
```
96+
97+
Output:
98+
```python
99+
[{
100+
"key": "audio",
101+
"text": "欢迎大家来体验达摩院推出的语音识别模型。", # with punctuation
102+
"text_tn": "欢迎大家来体验达摩院推出的语音识别模型", # without punctuation
103+
"timestamps": [
104+
{"token": "", "start_time": 0.0, "end_time": 3.06},
105+
{"token": "", "start_time": 3.06, "end_time": 3.12},
106+
...
107+
]
108+
}]
71109
```
72110

111+
> **Note:** Fun-ASR-Nano is a standalone model. Do NOT combine with `vad_model` or `punc_model`. Fun-ASR-Nano uses autoregressive decoding (token-by-token generation, like GPT), which only supports `batch_size=1`. However, FunASR's VAD pipeline (`inference_with_vad`) automatically sets a large batch size (default 300s worth of audio per batch) to process multiple VAD segments in parallel — this triggers Fun-ASR-Nano's `batch decoding is not implemented` error. This is a FunASR framework limitation, not a fundamental model constraint. Fun-ASR-Nano handles long audio end-to-end internally and does not need external VAD.
112+
113+
#### paraformer / paraformer-zh
114+
115+
Classic Paraformer ASR. 220M params. `paraformer` is for short audio (max 20s), `paraformer-zh` supports arbitrary length with SeACo.
116+
117+
```python
118+
model = asr.load_model("paraformer")
119+
result = model(audio="audio.wav")
120+
# [{"key": "audio", "text": "欢迎大家来体验达摩院推出的语音识别模型"}]
121+
```
122+
123+
`paraformer-zh` is designed for the full pipeline:
124+
125+
```python
126+
model = asr.load_model("paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc")
127+
result = model(audio="long_audio.wav")
128+
# [{"key": "audio", "text": "欢迎大家来体验达摩院推出的语音识别模型。"}]
129+
```
130+
131+
### VAD (Voice Activity Detection)
132+
133+
#### fsmn-vad
134+
135+
Detects speech segments in audio. 0.4M params, 16kHz.
136+
137+
```python
138+
vad = asr.load_model("fsmn-vad")
139+
result = vad(audio="audio.wav")
140+
```
141+
142+
Output:
143+
```python
144+
[{"key": "audio", "value": [[610, 5530]]}]
145+
```
146+
147+
`value` contains a list of `[start_ms, end_ms]` pairs indicating speech segments.
148+
149+
### Punctuation
150+
151+
#### ct-punc
152+
153+
Adds punctuation to raw text. 1.1G params, supports zh + en.
154+
155+
```python
156+
punc = asr.load_model("ct-punc")
157+
result = punc(text="你好世界今天天气真好我们一起出去玩吧")
158+
```
159+
160+
Output:
161+
```python
162+
[{"key": "...", "text": "你好,世界今天天气真好,我们一起出去玩吧。", "punc_array": [1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 3]}]
163+
```
164+
165+
`punc_array` values: `1` = none, `2` = comma, `3` = period.
166+
167+
### Speaker Embedding
168+
169+
#### cam++
170+
171+
Extracts speaker embedding vectors. 7.2M params, outputs 192-dim vector.
172+
173+
```python
174+
spk = asr.load_model("cam++")
175+
result = spk(audio="audio.wav")
176+
```
177+
178+
Output:
179+
```python
180+
[{"spk_embedding": [[-0.769, 0.930, -0.338, ..., 1.158, 0.615]]}] # 192-dim
181+
```
182+
183+
Can be used for speaker verification by comparing cosine similarity between embeddings.
184+
185+
### Emotion Recognition
186+
187+
#### emotion2vec_plus_base / emotion2vec_plus_large
188+
189+
Speech emotion recognition. Classifies into 9 emotion categories.
190+
191+
```python
192+
emo = asr.load_model("emotion2vec_plus_base")
193+
result = emo(audio="audio.wav")
194+
```
195+
196+
Output:
197+
```python
198+
[{
199+
"key": "audio",
200+
"labels": ["生气/angry", "厌恶/disgusted", "恐惧/fearful", "开心/happy",
201+
"中立/neutral", "其他/other", "难过/sad", "吃惊/surprised", "<unk>"],
202+
"scores": [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
203+
"feats": [...] # 768-dim embedding
204+
}]
205+
```
206+
207+
## Pipeline Combinations
208+
209+
Some models can be combined into a pipeline via `load_model()` parameters:
210+
211+
| Main Model | + vad_model | + punc_model | + spk_model | Notes |
212+
|-----------|-------------|--------------|-------------|-------|
213+
| `SenseVoiceSmall` | `fsmn-vad` | -- | -- | VAD for long audio. Do NOT use ct-punc (corrupts tags). |
214+
| `paraformer-zh` | `fsmn-vad` | `ct-punc` | `cam++` | Full pipeline, official FunASR recommendation. |
215+
| `paraformer-en-spk` | `fsmn-vad` | `ct-punc` | -- | English ASR with built-in speaker diarization. |
216+
| `Fun-ASR-Nano` | -- | -- | -- | Standalone only. Errors if combined with VAD/punc. |
217+
| `emotion2vec_*` | -- | -- | -- | Standalone only. |
218+
| `cam++` | -- | -- | -- | Standalone only. |
219+
| `ct-punc` | -- | -- | -- | Standalone only. Takes text input. |
220+
| `fsmn-vad` | -- | -- | -- | Standalone only. |
221+
222+
### Pipeline example
223+
224+
```python
225+
# Long Chinese audio: paraformer-zh + VAD + punctuation
226+
model = asr.load_model("paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc")
227+
result = model(audio="meeting.wav")
228+
229+
# Long audio: SenseVoiceSmall + VAD (no punc)
230+
model = asr.load_model("SenseVoiceSmall", vad_model="fsmn-vad")
231+
result = model(audio="long_audio.wav")
232+
```
233+
234+
## Input Methods
235+
236+
All audio models accept three input types:
237+
238+
```python
239+
model = asr.load_model("SenseVoiceSmall")
240+
241+
# 1. File path
242+
result = model(audio="audio.wav")
243+
244+
# 2. Raw bytes
245+
audio_bytes = Path("audio.wav").read_bytes()
246+
result = model(audio_bytes=audio_bytes)
247+
248+
# 3. Text (for punctuation models only)
249+
punc = asr.load_model("ct-punc")
250+
result = punc(text="你好世界今天天气真好")
251+
```
252+
253+
## All Available Models
254+
255+
| Name | Type | Params | Description |
256+
|------|------|--------|-------------|
257+
| `SenseVoiceSmall` | asr | 234M | Multi-task ASR, zh/en/ja/ko/yue, emotion + event tags |
258+
| `Fun-ASR-Nano` | asr | 800M | End-to-end ASR, built-in punctuation + timestamps |
259+
| `Fun-ASR-MLT-Nano` | asr | 800M | Multilingual ASR, 31 languages |
260+
| `paraformer` | asr | 220M | Offline, zh + en, max 20s |
261+
| `paraformer-zh` | asr | 220M | Offline, zh + en, arbitrary length (with SeACo) |
262+
| `paraformer-en` | asr | 220M | Offline, English |
263+
| `paraformer-en-spk` | asr | 220M | English + built-in speaker diarization |
264+
| `paraformer-zh-streaming` | asr | 220M | Streaming, zh + en |
265+
| `Whisper-large-v2` | asr | 1550M | OpenAI Whisper large-v2, multilingual |
266+
| `Whisper-large-v3` | asr | 1550M | OpenAI Whisper large-v3, multilingual |
267+
| `Whisper-large-v3-turbo` | asr | 809M | OpenAI Whisper large-v3 turbo |
268+
| `fsmn-vad` | vad | 0.4M | Voice activity detection, 16kHz |
269+
| `ct-punc` | punc | 1.1G | Punctuation restoration, zh + en |
270+
| `ct-punc-c` | punc | 291M | Punctuation restoration (compact), zh + en |
271+
| `cam++` | spk | 7.2M | Speaker embedding, 192-dim |
272+
| `fa-zh` | fa | 37.8M | Forced alignment / timestamp prediction, zh |
273+
| `emotion2vec_plus_large` | emotion | 300M | Emotion recognition, 9 classes |
274+
| `emotion2vec_plus_base` | emotion | - | Emotion recognition (base) |
275+
| `emotion2vec_plus_seed` | emotion | - | Emotion recognition (seed) |
276+
277+
Model names are automatically resolved to the correct hub (ModelScope in China, HuggingFace internationally).
278+
73279
## API Reference
74280

75281
### `FunASR(runtime_dir, port, host)`
@@ -80,34 +286,82 @@ asr.stop()
80286
| `port` | `0` (auto) | Server port |
81287
| `host` | `"127.0.0.1"` | Bind host |
82288

83-
### Methods
84-
85-
| Method | Description |
86-
|--------|-------------|
87-
| `ensure_installed()` | Install runtime environment (one-time) |
88-
| `start()` | Start the background server |
89-
| `stop()` | Stop the server |
90-
| `load_model(model, name, ...)` | Load any FunASR model |
91-
| `unload_model(name)` | Unload a model and free memory |
92-
| `infer(input, name, **kwargs)` | Run inference on any loaded model |
93-
| `transcribe(audio, name, **kwargs)` | Convenience alias for ASR |
94-
| `execute(code)` | Execute arbitrary Python code on the server |
95-
| `health()` | Check server status |
96-
| `list_models()` | List loaded models |
289+
### FunASR Methods
290+
291+
| Method | Returns | Description |
292+
|--------|---------|-------------|
293+
| `ensure_installed()` | `bool` | Install runtime (one-time). Returns True if already installed. |
294+
| `start(timeout=60)` | `int` | Start server, returns port number. |
295+
| `stop()` | - | Stop the server. |
296+
| `load_model(model, ...)` | `Model` | Load a model, returns a `Model` handle. |
297+
| `health()` | `dict` | Check server status. |
298+
| `list_models()` | `dict` | List loaded models. |
299+
| `execute(code)` | `dict` | Execute Python code on the server. |
300+
301+
### `load_model()` Parameters
302+
303+
```python
304+
model = asr.load_model(
305+
model, # Required: model name ("SenseVoiceSmall", "fsmn-vad", etc.)
306+
vad_model=None, # VAD model for pipeline
307+
punc_model=None, # Punctuation model for pipeline
308+
spk_model=None, # Speaker model for pipeline
309+
device=None, # "cuda" / "cpu" / None (auto)
310+
hub=None, # "ms" / "hf" / None (auto)
311+
quantize=None, # Enable quantization
312+
fp16=None, # Enable half-precision
313+
batch_size=None, # Batch size
314+
disable_update=None, # Skip model update checks
315+
)
316+
```
317+
318+
### Model Methods
319+
320+
```python
321+
model = asr.load_model("SenseVoiceSmall")
322+
323+
# Inference
324+
result = model.infer(audio="file.wav")
325+
result = model.infer(audio_bytes=raw_bytes)
326+
result = model.infer(text="input text")
327+
328+
# Shorthand
329+
result = model(audio="file.wav")
330+
331+
# Alias for ASR
332+
result = model.transcribe(audio="file.wav")
333+
334+
# Unload from memory
335+
model.unload()
336+
```
337+
338+
**Inference parameters** (passed to `infer()` or `__call__()`):
339+
340+
| Parameter | Type | Description |
341+
|-----------|------|-------------|
342+
| `audio` | `str` | Path to audio file |
343+
| `audio_bytes` | `bytes` | Raw audio bytes |
344+
| `text` | `str` | Text input (for punctuation models) |
345+
| `language` | `str` | Language hint (`"zh"`, `"en"`, `"ja"`, etc.) |
346+
| `use_itn` | `bool` | Enable inverse text normalization |
347+
| `batch_size` | `int` | Inference batch size |
348+
| `hotword` | `str` | Hotword string for biased recognition |
349+
| `merge_vad` | `bool` | Merge short VAD segments |
350+
| `merge_length_s` | `float` | Max merge length in seconds (default: 15) |
97351

98352
## Architecture
99353

100354
```
101355
Your Application
102-
103-
HTTP (localhost)
104-
JSON-RPC 2.0
105-
356+
|
357+
| HTTP (localhost)
358+
| JSON-RPC 2.0
359+
v
106360
FunASR Server (background process)
107-
108-
├── Models loaded in memory
109-
├── Isolated Python environment (uv)
110-
└── Auto GPU/CPU detection
361+
|
362+
|-- Models loaded in memory
363+
|-- Isolated Python environment (uv)
364+
+-- Auto GPU/CPU detection
111365
```
112366

113367
The server runs in a completely isolated Python environment managed by `uv`. Your application communicates with it over HTTP using JSON-RPC 2.0 protocol.

0 commit comments

Comments
 (0)