Skip to content

Commit 4094235

Browse files
authored
Merge pull request #1587 from pareenaverma/content_review
Tech review of funASR LP
2 parents 668537d + 71be53f commit 4094235

File tree

4 files changed

+105
-59
lines changed

4 files changed

+105
-59
lines changed

content/learning-paths/servers-and-cloud-computing/funASR/2_modelscope.md

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: ModelScope - Open-Source AI Pre-trained AI models hub
2+
title: ModelScope - Open-Source Pre-trained AI models hub
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
## Before you begin
1010

11-
To follow the instructions for this Learning Path, you will need an Arm server running Ubuntu 22.04 LTS or later version with at least 8 cores, 16GB of RAM, and 30GB of disk storage.
11+
To follow the instructions for this Learning Path, you will need an Arm based server running Ubuntu 22.04 LTS or later version with at least 8 cores, 16GB of RAM, and 30GB of disk storage.
1212

1313
## Introduce ModelScope
1414
[ModelScope](https://github.com/modelscope/modelscope/) is an open-source platform that makes it easy to use AI models in your applications.
@@ -34,13 +34,13 @@ Arm provides optimized software and tools, such as Kleidi, to accelerate AI infe
3434
You can learn more about [Faster PyTorch Inference using Kleidi on Arm Neoverse](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/faster-pytorch-inference-kleidi-arm-neoverse) from Arm community website.
3535

3636

37-
## Installing ModelScope
37+
## Install ModelScope and PyTorch
3838

3939
First, ensure your system is up-to-date and install the required tools and libraries:
4040

4141
```bash
4242
sudo apt-get update -y
43-
sudo apt-get install -y curl git wget python3 python3-pip python3-venv python-is-python3
43+
sudo apt-get install -y curl git wget python3 python3-pip python3-venv python-is-python3 ffmpeg
4444
```
4545

4646
Create and activate a virtual environment:
@@ -49,19 +49,25 @@ python -m venv venv
4949
source venv/bin/activate
5050
```
5151

52-
Install related packages:
52+
In your active virtual environment, install modelscope:
53+
54+
```bash
55+
pip3 install modelscope
56+
```
57+
58+
Install PyTorch and related python dependencies:
5359
```bash
5460
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
5561
pip3 install numpy packaging addict datasets simplejson sortedcontainers transformers ffmpeg
5662

5763
```
5864
{{% notice Note %}}
59-
This learning path will execute models on Arm Neoverse, so we only need to install the PyTorch CPU package.
65+
In this learning path you will execute models on the Arm Neoverse CPU, so you will only need to install the PyTorch CPU package.
6066
{{% /notice %}}
6167

6268
## Create a sample example
6369

64-
After completing the installation, we will use an example related to Chinese semantic understanding to illustrate how to use ModelScope.
70+
You can now run an example to understand how to use ModelScope for understanding Chinese semantics.
6571

6672
There is a fundamental difference between Chinese and English writing.
6773
The relationship between Chinese characters and their meanings is somewhat analogous to the difference between words and phrases in English.
@@ -70,9 +76,11 @@ Some Chinese characters, like English words, have clear meanings on their own, s
7076
However, more often, Chinese characters need to be combined with other characters to express more complete meanings, just like phrases in English.
7177
For example, “祝福” (blessing) can be broken down into “祝” (wish) and “福” (good fortune); “分享” (share) can be broken down into “分” (divide) and “享” (enjoy); “生成” (generate) is composed of “生” (produce) and “成” (become).
7278

73-
For computers to understand Chinese sentences, we need to understand the rules of Chinese characters, vocabulary, and grammar to accurately understand and express meaning.
79+
For computers to understand Chinese sentences, you will need to understand the rules of Chinese characters, vocabulary, and grammar to accurately understand and express meaning.
80+
81+
In this simple example, you will use a general-domain Chinese [word segmentation model](https://www.modelscope.cn/models/iic/nlp_structbert_word-segmentation_chinese-base) to break down Chinese sentences into individual words, facilitating analysis and understanding by computers.
7482

75-
Here ia a simple example using a general-domain Chinese [word segmentation model](https://www.modelscope.cn/models/iic/nlp_structbert_word-segmentation_chinese-base), which can break down Chinese sentences into individual words, facilitating analysis and understanding by computers.
83+
Using a file editor of your choice, copy the code shown below into a file named `segmentation.py`:
7684

7785
```python
7886
from modelscope.pipelines import pipeline
@@ -84,7 +92,13 @@ result = word_segmentation(text)
8492
print(result)
8593
```
8694

87-
The output will be like this:
95+
Run the model inference on the sample text:
96+
97+
```bash
98+
python3 segmentation.py
99+
```
100+
101+
The output should look like this:
88102
```output
89103
2025-01-28 00:30:29,692 - modelscope - WARNING - Model revision not specified, use revision: v1.0.3
90104
Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/damo/nlp_structbert_word-segmentation_chinese-base

content/learning-paths/servers-and-cloud-computing/funASR/3_funasr.md

Lines changed: 74 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,16 @@ layout: learningpathall
1313
## Installing FunASR
1414
Install FunASR using pip:
1515
```bash
16-
pip3 install funasr
16+
pip3 install funasr==1.2.3
1717
```
1818
{{% notice Note %}}
19-
The following content is based on tests conducted using FunASR version 1.2.3. Variations may exist in different versions.
19+
The learning path examples use FunASR version 1.2.3. You may notice minor differences in results with other versions.
2020
{{% /notice %}}
2121

22-
## Performing Speech Recognition
23-
FunASR offers a simple interface for performing speech recognition tasks. You can easily transcribe audio files or implement real-time speech recognition using FunASR's functionalities. In this learning path, you will learn how to leverage FunASR to implement speech recognition application.
22+
## Speech Recognition
23+
FunASR offers a simple interface for performing speech recognition tasks. You can easily transcribe audio files or implement real-time speech recognition using FunASR's functionalities. In this learning path, you will learn how to leverage FunASR to implement a speech recognition application.
2424

25-
Let's use a sample English speech voice as an example.
25+
Let's use an English speech voice sample as an example to run audio transcription on. Copy the code shown below into a file named `funasr_test1.py`
2626

2727
```python
2828
from funasr import AutoModel
@@ -36,37 +36,41 @@ res = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/Ma
3636
print(f"\nResult: \n{res[0]['text']}")
3737
```
3838

39-
Quick explain of the above Python code:
39+
Before you run this script, lets look at what the Python code is doing:
4040

41-
AutoModel(): This is a class that provides an interface to load different AI models.
41+
The imported `AutoModel()` class provides an interface to load different AI models.
4242

4343
* **model="paraformer":**
4444

45-
Specifies the which model you'd like to load.
46-
In this example you will load Paraformer model, which is an end-to-end automatic speech recognition (ASR) model designed for real-time transcription.
45+
Specifies the model you would like to load.
46+
In this example you will load the Paraformer model, which is an end-to-end automatic speech recognition (ASR) model designed for real-time transcription.
4747

4848
* **device="cpu":**
4949

50-
Specify the model runs on the CPU (instead of a GPU).
50+
Specify the model runs on the CPU. It does not require a GPU.
5151

5252
* **hub="ms":**
5353

5454
Indicates that the model is sourced from the "ms" (ModelScope) hub.
5555

56-
model.generate(): This function processes an audio file and generates a transcribed text output.
56+
The `model.generate()` function processes an audio file and generates a transcribed text output.
5757
* **input="...":**
5858

59-
The input is an audio file URL, which is a .wav file containing spoken content.
59+
The input is an audio file URL, which is a .wav file containing an English audio sample.
6060

61-
62-
Since the result contains a lot of information, to make it sample, we will only list the content of res[0]['text'].
61+
The result contains a lot of information. To keep the example simple, you will only list the transcribed text contained in res[0]['text'].
6362

6463
In this initial test, a two-second English audio clip from the internet will be used for paraformer model to infleune the wave file.
6564

66-
Copy the Python code and execute in your Arm Neoverse, the result will looks like:
65+
Run this Python script on your Arm based server:
6766

68-
```output
67+
```bash
6968
python funasr_test1.py
69+
```
70+
71+
The output will look like:
72+
73+
```output
7074
funasr version: 1.2.3.
7175
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
7276
You are using the latest version of funasr-1.2.3
@@ -91,9 +95,9 @@ Result:
9195
he tried to think how it could be
9296
```
9397

94-
The output shows "he tried to think how it could be" as expected.
98+
The transcribed test shows "he tried to think how it could be". This is the expected result for the audio sample.
9599

96-
After understanding the basic usage, let's use a Chinese model.
100+
Now lets try an example that uses a Chinese speech recognition model. Copy the code shown below in a file named `funasr_test2.py`:
97101

98102
```python
99103
import os
@@ -111,16 +115,22 @@ res = model.generate(input=wav_file)
111115
text_content = res[0]['text'].replace(" ","")
112116
print(f"Result: \n{text_content}")
113117

114-
pring(res)
118+
print(res)
115119
```
116120

117-
You can see that the executed model has been replaced with a model that has a 'zh' suffix.
121+
You can see that the loaded model has been replaced with a Chinese speech recognition model that has a `-zh` suffix.
118122

119-
FunASR will recognise each sound in the speech with appropriate character recognition.
123+
FunASR will process each sound in the audio with appropriate character recognition.
120124

121-
We'd like to slightly modify the output format. In addition to recognizing Chinese characters, we'll also add timestamps indicating the start and end times of each character. This will facilitate applications such as subtitle generation and sentiment analysis.
125+
You have also modified the output format from the previous example. In addition to recognising the Chinese characters, you will add timestamps indicating the start and end times of each character. This is used for applications like subtitle generation and sentiment analysis.
122126

123-
The output should be looks like:
127+
Run the Python script:
128+
129+
```bash
130+
python3 funasr_test2.py
131+
```
132+
133+
The output should look like:
124134

125135
```output
126136
Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
@@ -137,13 +147,14 @@ Result:
137147

138148
The output shows "欢迎大家来到达摩社区进行体验" as expected.
139149

140-
You can also observe that the spacing between the third and sixth characters is very short. This is because they are combined with other characters, as discussed in the previous session.
150+
You can also observe that the spacing between the third and sixth characters is very short. This is because they are combined with other characters, as discussed in the previous section.
141151

142-
The speech processing pipeline can be conceptualized as follows: the output of the speech recognition module serves as the input for the semantic segmentation model, enabling us to validate the accuracy of the recognized results.
152+
You can now build a speech processing pipeline. The output of the speech recognition module serves as the input for the semantic segmentation model, enabling you to validate the accuracy of the recognized results. Copy the code shown below in a file named `funasr_test3.py`:
143153

144154
```python
145155
from funasr import AutoModel
146156
from modelscope.pipelines import pipeline
157+
import os
147158

148159
model = AutoModel(
149160
model="paraformer-zh",
@@ -164,8 +175,13 @@ seg_result = word_segmentation(text_content)
164175

165176
print(f"Result: \n{seg_result}")
166177
```
178+
Run this Python script:
167179

168-
The output will be looks like:
180+
```bash
181+
python3 funasr_test3.py
182+
```
183+
184+
The output should look like:
169185

170186
```output
171187
Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
@@ -196,17 +212,17 @@ Result:
196212
{'output': ['欢迎', '大家', '来到', '达摩', '社区', '进行', '体验']}
197213
```
198214

199-
Good, the result exactly what we were looking for.
215+
Good, the result is exactly what you are looking for.
200216

201217
## Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
202218

203-
Now, I'd like to introduce more advance speech recognition model, [Paraformer](https://aclanthology.org/2020.wnut-1.18/).
219+
Lets now look at a more advanced speech recognition model, [Paraformer](https://aclanthology.org/2020.wnut-1.18/).
204220

205221
Paraformer is a novel architecture for automatic speech recognition (ASR) that offers both enhanced speed and accuracy compared to traditional models. Its key innovation lies in its parallel transformer design, enabling simultaneous processing of multiple parts of the input speech. This parallel processing capability leads to significantly faster inference, making Paraformer well-suited for real-time ASR applications where responsiveness is crucial.
206222

207223
Furthermore, Paraformer has demonstrated state-of-the-art accuracy on several benchmark datasets, showcasing its effectiveness in accurately transcribing speech. This combination of speed and accuracy makes Paraformer a promising advancement in the field of ASR, opening up new possibilities for high-performance speech recognition systems.
208224

209-
Paraformer has been fully integrated into FunASR. Here is a sample program.
225+
Paraformer has been fully integrated into FunASR. Copy the sample program shown below into a file named `paraformer.py`.
210226

211227
This example uses PyTorch-optimized Paraformer model from ModelScope, the program will first check if the test audio file has been downloaded.
212228

@@ -240,8 +256,13 @@ rec_result = inference_pipeline(input=filename)
240256

241257
print(f"\nResult: \n{rec_result[0]['text']}")
242258
```
259+
Run this Python script
260+
261+
```bash
262+
python3 paraformer.py
263+
```
243264

244-
When you execute the code, the output will looks like:
265+
The output should look like:
245266

246267
```output
247268
2025-01-28 00:03:24,373 - modelscope - INFO - Use user-specified model revision: v2.0.4
@@ -273,17 +294,15 @@ The output shows "飞机穿过云层眼下一片云海有时透过稀薄的云
273294

274295
## Punctuation Restoration
275296

276-
In the previous example, the speech of each word was correctly recognized, but the lack of punctuation hindered understanding the speaker's intended expression.
297+
In the previous example, the speech of each word was correctly recognized, but it lacked punctuation. The lack of punctuation hinders our understanding of the speaker's intended expression.
277298

278-
Therefore, we can add a [Punctuation Restoration model](https://aclanthology.org/2020.wnut-1.18/) responsible for punctuation as the next stage in the audio workload.
299+
You can add a [Punctuation Restoration model](https://aclanthology.org/2020.wnut-1.18/) responsible for punctuation as the next step in processing your audio workload.
279300

280-
In the example above, we continue to use the Paraformer model from the previous example and add two ModelScope's model:
301+
In addition to using the Paraformer model, you will add two more ModelScope models:
281302
- VAD ([Voice Activity Detection](https://modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary)) and
282303
- PUNC ([Punctuation Restoration](https://modelscope.cn/models/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files))
283304

284-
by add `vad_model` and `punc_model` parameters in the later stages.
285-
286-
This way, we can obtain punctuation that matches the semantics of the speech recognition.
305+
This way, you can obtain punctuation that matches the semantics of the speech recognition. Copy the updated code shown below in a file named `paraformer-2.py`:
287306

288307
```python
289308
import os
@@ -320,11 +339,17 @@ print(f"\nResult: \n{rec_result[0]['text']}")
320339
```
321340

322341
{{% notice Note %}}
323-
vad_model_revision & punc_model_revision are not a required parameter. In most cases, it can work smoothly without specifying the version.
342+
vad_model_revision & punc_model_revision are optional parameters. In most cases, your models should work without specifying the version.
324343
{{% /notice %}}
325344

345+
Run the updated Python script:
326346

327-
The entire speech is correctly segmented into four parts based on semantics.
347+
```bash
348+
python3 paraformer-2.py
349+
```
350+
351+
352+
The entire speech sample is correctly segmented into four parts based on semantics.
328353

329354
```output
330355
rtf_avg: 0.047: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.45it/s]
@@ -335,7 +360,7 @@ Result:
335360
飞机穿过云层,眼下一片云海,有时透过稀薄的云雾,依稀可见南国葱绿的群山大地。
336361
```
337362

338-
Simply translate this recognized result, and you can easily see that the four sentences represent different meanings.
363+
Lets translate this recognized result, and you can easily see that the four sentences represent different meanings.
339364

340365
"飞机穿过云层" means: The airplane passed through the clouds.
341366

@@ -348,11 +373,11 @@ Simply translate this recognized result, and you can easily see that the four se
348373

349374

350375
## Sentiment Analysis
351-
FunASR also supports sentiment analysis of speech, allowing you to determine the emotional tone of spoken language.
376+
FunASR also supports sentiment analysis of speech, allowing you to determine the emotional tone of the spoken language.
352377

353378
This can be valuable for applications like customer service and social media monitoring.
354379

355-
We use a mature speech emotion recognition model [emotion2vec+](https://modelscope.cn/models/iic/emotion2vec_plus_large) on ModelScope as an example.
380+
You can use a mature speech emotion recognition model [emotion2vec+](https://modelscope.cn/models/iic/emotion2vec_plus_large) from ModelScope as an example.
356381

357382
The model will identify which of the following emotions is the closest match for the emotion expressed in the speech:
358383
- Neutral
@@ -361,6 +386,7 @@ The model will identify which of the following emotions is the closest match for
361386
- Angry
362387
- Unknow
363388

389+
Copy the code shown below in a file named `sentiment.py`:
364390

365391
```python
366392
from modelscope.pipelines import pipeline
@@ -410,10 +436,16 @@ process_audio_file(
410436
'https://utoronto.scholaris.ca/bitstreams/5ce257a3-be71-41a8-8d88-d097ca15af4e/download'
411437
)
412438

439+
```
440+
Run this script:
441+
442+
```bash
443+
python3 sentiment.py
413444
```
414445

415-
Without a model that understands semantics, emotion2vec+ can still correctly recognize the speaker's emotions through changes in intonation.
446+
Without a model that understands semantics, `emotion2vec+` can still correctly recognize the speaker's emotions through changes in intonation.
416447

448+
The output should look like:
417449

418450
```output
419451
Neutral Chinese Speech
@@ -427,9 +459,5 @@ rtf_avg: 1.444: 100%|███████████████████
427459
Result: ['生气/angry (1.00)', '中立/neutral (0.00)', '开心/happy (0.00)', '难过/sad (0.00)', '<unk> (0.00)']
428460
```
429461

430-
## Best Price-Performance for ASR on Arm Neoverse N2
431-
Arm CPUs, with their high performance and low power consumption, provide an ideal platform for running ModelScope's AI models, especially in edge computing scenarios. Arm's comprehensive software ecosystem supports the development and deployment of ModelScope models, enabling developers to create innovative and efficient applications.
432-
You can learn more about [Kleidi Technology Delivers Best Price-Performance for ASR on Arm Neoverse N2](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/neoverse-n2-delivers-leading-price-performance-on-asr) from the Arm community blog.
433-
434462
## Conclusion
435463
ModelScope and FunASR empower developers to build robust Chinese ASR applications. By leveraging the strengths of Arm CPUs and the optimized software ecosystem, developers can create innovative and efficient solutions for various use cases. Explore the capabilities of ModelScope and FunASR, and unlock the potential of Arm technology for your next Chinese ASR project.

0 commit comments

Comments
 (0)