Skip to content

Commit 96024c4

Browse files
committed
change dataset download to after env setup and reference script usage
1 parent 6511f34 commit 96024c4

File tree

1 file changed

+29
-42
lines changed
  • AI-and-Analytics/End-to-end-Workloads/LanguageIdentification

1 file changed

+29
-42
lines changed

AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/README.md

Lines changed: 29 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@ Spoken audio comes in different languages and this sample uses a model to identi
2525

2626
The [CommonVoice](https://commonvoice.mozilla.org/) dataset is used to train an Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Network (ECAPA-TDNN). This is implemented in the [Hugging Face SpeechBrain](https://huggingface.co/SpeechBrain) library. Additionally, a small Convolutional Recurrent Deep Neural Network (CRDNN) pretrained on the LibriParty dataset is used to process audio samples and output the segments where speech activity is detected.
2727

28-
After you have downloaded the CommonVoice dataset, the data must be preprocessed by converting the MP3 files into WAV format and separated into training, validation, and testing sets.
29-
3028
The model is then trained from scratch using the Hugging Face SpeechBrain library. This model is then used for inference on the testing dataset or a user-specified dataset. There is an option to utilize SpeechBrain's Voice Activity Detection (VAD) where only the speech segments from the audio files are extracted and combined before samples are randomly selected as input into the model. To improve performance, the user may quantize the trained model to INT8 using Intel® Neural Compressor (INC) to decrease latency.
3129

3230
The sample contains three discreet phases:
@@ -39,46 +37,6 @@ For both training and inference, you can run the sample and scripts in Jupyter N
3937

4038
## Prepare the Environment
4139

42-
### Download the CommonVoice Dataset
43-
44-
>**Note**: You can skip downloading the dataset if you already have a pretrained model and only want to run inference on custom data samples that you provide.
45-
46-
Download the CommonVoice dataset for languages of interest from [https://commonvoice.mozilla.org/en/datasets](https://commonvoice.mozilla.org/en/datasets).
47-
48-
For this sample, you will need to download the following languages: **Japanese** and **Swedish**. Follow Steps 1-6 below or you can execute the code.
49-
50-
1. On the CommonVoice website, select the Version and Language.
51-
2. Enter your email.
52-
3. Check the boxes, and right-click on the download button to copy the link address.
53-
4. Paste this link into a text editor and copy the first part of the URL up to ".tar.gz".
54-
5. Use **GNU wget** on the URL to download the data to `/data/commonVoice` or a folder of your choice.
55-
56-
Alternatively, you can use a directory on your local drive due to the large amount of data.
57-
58-
6. Extract the compressed folder, and rename the folder with the language (for example, English).
59-
60-
The file structure **must match** the `LANGUAGE_PATHS` defined in `prepareAllCommonVoice.py` in the `Training` folder for the script to run properly.
61-
62-
These commands illustrate Steps 1-6. Notice that it downloads Japanese and Swedish from CommonVoice version 11.0.
63-
```
64-
# Create the commonVoice directory under 'data'
65-
sudo chmod 777 -R /data
66-
cd /data
67-
mkdir commonVoice
68-
cd commonVoice
69-
70-
# Download the CommonVoice data
71-
wget \
72-
https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-11.0-2022-09-21/cv-corpus-11.0-2022-09-21-ja.tar.gz \
73-
https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-11.0-2022-09-21/cv-corpus-11.0-2022-09-21-sv-SE.tar.gz
74-
75-
# Extract and organize the CommonVoice data into respective folders by language
76-
tar -xf cv-corpus-11.0-2022-09-21-ja.tar.gz
77-
mv cv-corpus-11.0-2022-09-21 japanese
78-
tar -xf cv-corpus-11.0-2022-09-21-sv-SE.tar.gz
79-
mv cv-corpus-11.0-2022-09-21 swedish
80-
```
81-
8240
### Create and Set Up Environment
8341

8442
1. Create your conda environment by following the instructions on the Intel [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). You can follow these settings:
@@ -114,6 +72,35 @@ cd oneAPI-samples/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification
11472
source initialize.sh
11573
```
11674

75+
### Download the CommonVoice Dataset
76+
77+
>**Note**: You can skip downloading the dataset if you already have a pretrained model and only want to run inference on custom data samples that you provide.
78+
79+
First, change to the `Dataset` directory.
80+
```
81+
cd ./Dataset
82+
```
83+
84+
The `get_dataset.py` script downloads the Common Voice dataset by doing the following:
85+
86+
- Gets the train set of the [Common Voice dataset from Huggingface](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) for Japanese and Swedish
87+
- Downloads each mp3 and moves them to the `output_dir` folder
88+
89+
1. If you want to add additional languages, then modify the `language_to_code` dictionary in the file to reflect the languages to be included in the model.
90+
91+
3. Run the script with options.
92+
```bash
93+
python get_dataset.py --output_dir ${COMMON_VOICE_PATH}
94+
```
95+
| Parameters | Description
96+
|:--- |:---
97+
| `--output_dir` | Base output directory for saving the files. Default is /data/commonVoice
98+
99+
Once the dataset is downloaded, navigate back to the parent directory
100+
```
101+
cd ..
102+
```
103+
117104
## Train the Model with Languages
118105

119106
This section explains how to train a model for language identification using the CommonVoice dataset, so it includes steps on how to preprocess the data, train the model, and prepare the output files for inference.

0 commit comments

Comments
 (0)