You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/README.md
+29-42Lines changed: 29 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,8 +25,6 @@ Spoken audio comes in different languages and this sample uses a model to identi
25
25
26
26
The [CommonVoice](https://commonvoice.mozilla.org/) dataset is used to train an Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Network (ECAPA-TDNN). This is implemented in the [Hugging Face SpeechBrain](https://huggingface.co/SpeechBrain) library. Additionally, a small Convolutional Recurrent Deep Neural Network (CRDNN) pretrained on the LibriParty dataset is used to process audio samples and output the segments where speech activity is detected.
27
27
28
-
After you have downloaded the CommonVoice dataset, the data must be preprocessed by converting the MP3 files into WAV format and separated into training, validation, and testing sets.
29
-
30
28
The model is then trained from scratch using the Hugging Face SpeechBrain library. This model is then used for inference on the testing dataset or a user-specified dataset. There is an option to utilize SpeechBrain's Voice Activity Detection (VAD) where only the speech segments from the audio files are extracted and combined before samples are randomly selected as input into the model. To improve performance, the user may quantize the trained model to INT8 using Intel® Neural Compressor (INC) to decrease latency.
31
29
32
30
The sample contains three discreet phases:
@@ -39,46 +37,6 @@ For both training and inference, you can run the sample and scripts in Jupyter N
39
37
40
38
## Prepare the Environment
41
39
42
-
### Download the CommonVoice Dataset
43
-
44
-
>**Note**: You can skip downloading the dataset if you already have a pretrained model and only want to run inference on custom data samples that you provide.
45
-
46
-
Download the CommonVoice dataset for languages of interest from [https://commonvoice.mozilla.org/en/datasets](https://commonvoice.mozilla.org/en/datasets).
47
-
48
-
For this sample, you will need to download the following languages: **Japanese** and **Swedish**. Follow Steps 1-6 below or you can execute the code.
49
-
50
-
1. On the CommonVoice website, select the Version and Language.
51
-
2. Enter your email.
52
-
3. Check the boxes, and right-click on the download button to copy the link address.
53
-
4. Paste this link into a text editor and copy the first part of the URL up to ".tar.gz".
54
-
5. Use **GNU wget** on the URL to download the data to `/data/commonVoice` or a folder of your choice.
55
-
56
-
Alternatively, you can use a directory on your local drive due to the large amount of data.
57
-
58
-
6. Extract the compressed folder, and rename the folder with the language (for example, English).
59
-
60
-
The file structure **must match** the `LANGUAGE_PATHS` defined in `prepareAllCommonVoice.py` in the `Training` folder for the script to run properly.
61
-
62
-
These commands illustrate Steps 1-6. Notice that it downloads Japanese and Swedish from CommonVoice version 11.0.
# Extract and organize the CommonVoice data into respective folders by language
76
-
tar -xf cv-corpus-11.0-2022-09-21-ja.tar.gz
77
-
mv cv-corpus-11.0-2022-09-21 japanese
78
-
tar -xf cv-corpus-11.0-2022-09-21-sv-SE.tar.gz
79
-
mv cv-corpus-11.0-2022-09-21 swedish
80
-
```
81
-
82
40
### Create and Set Up Environment
83
41
84
42
1. Create your conda environment by following the instructions on the Intel [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). You can follow these settings:
@@ -114,6 +72,35 @@ cd oneAPI-samples/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification
114
72
source initialize.sh
115
73
```
116
74
75
+
### Download the CommonVoice Dataset
76
+
77
+
>**Note**: You can skip downloading the dataset if you already have a pretrained model and only want to run inference on custom data samples that you provide.
78
+
79
+
First, change to the `Dataset` directory.
80
+
```
81
+
cd ./Dataset
82
+
```
83
+
84
+
The `get_dataset.py` script downloads the Common Voice dataset by doing the following:
85
+
86
+
- Gets the train set of the [Common Voice dataset from Huggingface](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) for Japanese and Swedish
87
+
- Downloads each mp3 and moves them to the `output_dir` folder
88
+
89
+
1. If you want to add additional languages, then modify the `language_to_code` dictionary in the file to reflect the languages to be included in the model.
| `--output_dir` | Base output directory for saving the files. Default is /data/commonVoice
98
+
99
+
Once the dataset is downloaded, navigate back to the parent directory
100
+
```
101
+
cd ..
102
+
```
103
+
117
104
## Train the Model with Languages
118
105
119
106
This section explains how to train a model for language identification using the CommonVoice dataset, so it includes steps on how to preprocess the data, train the model, and prepare the output files for inference.
0 commit comments