-
Notifications
You must be signed in to change notification settings - Fork 381
Phoneme Detection and Classifier Model Codes #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AnirudhBHarish
wants to merge
8
commits into
microsoft:master
Choose a base branch
from
AnirudhBHarish:kws_phoneme
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
c43e777
Phoneme detection and classifier model codes
AnirudhBHarish 7203332
Add license
AnirudhBHarish 43a9b37
Remove redundant functions
AnirudhBHarish 113ab23
finish documenting kwscnn
AnirudhBHarish 35e0159
Fix typos
AnirudhBHarish 11b718e
Fix typos and punctuation
AnirudhBHarish ecd1d09
Minor modifications to comments and punctuation
AnirudhBHarish 8c78dad
Incorporate reviewer comments
AnirudhBHarish File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Phoneme-based Keyword Spotting(KWS) | ||
|
||
# Project Description | ||
There are two major issues in the existing KWS systems (a) They are not robust to heavy background noise and random utterances, and (b) They require collecting a lot of data, hampering the ease of adding a new keyword. Tackling these issues from a different perspective, we propose a new two staged scheme with a model for predicting phonemes which are in turn used for phoneme-based keyword classification. | ||
|
||
First we train a phoneme classification model which gives the phoneme transcription of the input speech snippet. For training this phoneme classifier, we use a large public speech dataset like LibriSpeech. The public dataset can be aligned (meaning we can get the phoneme labels for each speech snippet in the data) using Montreal Forced Aligner. We also add reverberations and additive noise to the speech samples from the public dataset to make the phoneme classifier training robust to various accents, background noise and varied environments. In this project, we predict phonemes at every 10ms which is the standard way. You can find the aligned LibriSpeech dataset we used for training here. | ||
|
||
In the second part, we use the predicted phoneme outputs from the phoneme classifier for predicting the input keyword. We train a 1 layer FastGRNN classifier to predict the keyword based on the phoneme transcription as input. Since the phoneme classifier training has been done to account for diverse accents, background noise and environments, the keyword classifier can be trained using a small number of Text-To-Speech(TTS) samples generated using any standard TTS API from cloud services like Azure, Google Cloud or AWS. | ||
|
||
This gives two advantages: (a) The phoneme model is trained to account for diverse accents and background noise settings, thus the flexible keyword classifier training requires only a small number of keyword samples, and (b) Empirically this method was able to detect keywords from as far as 9ft of distance. Further, the phoneme model has a small size of around 250k parameters and can fit on a Cortex M7 micro-controller. | ||
|
||
# Training the Phoneme Classifier | ||
1) Train a phoneme classification model on some public speech dataset like LibriSpeech. | ||
2) Training speech dataset can be labelled using Montreal Force Aligner. | ||
3) Speech snippets are convolved with reverberation files, and additive noises from YouTube or other open source are added. | ||
4) We also add white gaussian noise of various SNRs. | ||
|
||
# Training the KWS Model | ||
1) Our method takes as input the speech snippet and passes it through the phoneme classifier. | ||
2) Keywords are detected by training a keyword classifier over the detected phonemes. | ||
3) For training the keyword classifier, we use Azure and Google Text-To-Speech API to get the training data (keyword snippets). | ||
4) For example, if you want to train a keyword classifier for the keywords in the Google30 dataset, generate TTS samples from the Azure/Google-Cloud/AWS API for each of the 30 keywords. The TTS samples for each keyword must be stored in a separate folder named according to the keyword. More details about how the generated TTS data should be stored are mentioned below in sample use case for classifier model training. | ||
|
||
# Sample Use Cases | ||
|
||
## Phoneme Model Training | ||
The following command can be used to instantiate and train the phoneme model. | ||
``` | ||
python train_phoneme.py --base_path=/path/to/librispeech_data/ --rir_base_path=/path/to/reverb_files/ --additive_base_path=/path/to/additive_noises/ --snr_samples="0,5,10,25,100,100" --rir_chance=0.5 | ||
``` | ||
Some important command line arguments: | ||
1) base_path : Path of the speech data folder. The data in this folder should be in accordance to the data-loader code written here. | ||
2) rir_base_path, additive_base_path : Path to the reverb and additive noise files. | ||
3) snr_samples : List of various SNRs at which the additive noise is to be added. | ||
4) rir_chance : Probability that would decide if the reverberation operation has to be performed for a given speech sample. | ||
|
||
## Classifier Model Training | ||
The following command can be used to instantiate and train the classifier model. | ||
``` | ||
python train_classifier.py --base_path=/path/to/train_and_test_data_folders/ --train_data_folders=google30_azure_tts,google30_google_tts --test_data_folders=google30_test --phoneme_model_load_ckpt=/path/to/checkpoint/x.pt --rir_base_path=/mnt/reverb_noise_sampled/ --additive_base_path=/mnt/add_noises_sampled/ --synth | ||
``` | ||
Some important command line arguments: | ||
|
||
1) base_path : Path to train and test data folders. | ||
2) train_data_folders, test_data_folders : These folders should have the .wav files for each keyword in a separate subfolder inside according to the data-loader here. | ||
3) phoneme_model_load_ckpt : The full path of the checkpoint file that would be used to load the weights to the instantiated phoneme model. | ||
4) rir_base_path, additive_base_path : Path to the reverb and additive noise files. | ||
5) synth : Boolean flag for specifying if reverberations and noise addition has to be done. | ||
|
||
Copyright (c) Microsoft Corporation. All rights reserved. | ||
Licensed under the MIT license. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Auxiliary Files to help Download and Prepare the Data | ||
|
||
## YouTube Additive Noise | ||
Run the following commands to download the CSV Files to download the YouTube Additive Noise Data : | ||
|
||
``` | ||
wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv | ||
``` | ||
Followed by the extraction script to download the actual data : | ||
``` | ||
python download_youtube_data.py --csv_file=/path/to/csv_file.csv --target_folder=/path/to/target/folder/ | ||
``` | ||
|
||
Please check [Google's Audioset data page](https://research.google.com/audioset/download.html) for further details. | ||
|
||
The downloaded files would need to be converted to 16KHz for our pipeline. Please run the following for the same : | ||
``` | ||
python convert_sampling_rate.py --source_folder=/path/to/csv_file.csv --target_folder=/path/to/target/16KHz_folder/ --fs=16000 --log_rate=100 | ||
``` | ||
The script can convert the sampling rate of any .wav file to the specified --fs. But for our applications, we use 16KHz only.<br/> | ||
ShikharJ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Choose the log rate for how often the log should be printed for the sample rate conversion. This will print a string every log_rate iterations. | ||
|
||
Copyright (c) Microsoft Corporation. All rights reserved. | ||
Licensed under the MIT license. |
45 changes: 45 additions & 0 deletions
45
applications/KWS_Phoneme/auxiliary_files/convert_sampling_rate.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Copyright (c) Microsoft Corporation. All rights reserved. | ||
# Licensed under the MIT license. | ||
|
||
import os | ||
import librosa | ||
import numpy as np | ||
import soundfile as sf | ||
import argparse | ||
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--source_folder', default=None, required=True) | ||
parser.add_argument('--target_folder', default=None, required=True) | ||
parser.add_argument('--fs', type=int, default=16000) | ||
parser.add_argument('--log_rate', type=int, default=1000) | ||
args = parser.parse_args() | ||
|
||
source_folder = args.source_folder | ||
target_folder = args.target_folder | ||
fs = args.fs | ||
log_rate = args.log_rate | ||
print(f'Source Folder :: {source_folder}\nTarget Folder :: {target_folder}\nSampling Frequency :: {fs}', flush=True) | ||
|
||
source_files = [] | ||
target_files = [] | ||
list_completed = [] | ||
|
||
# Get the list of list of wav files from source folder and create target file names (full paths) | ||
for i, f in enumerate(os.listdir(source_folder)): | ||
if f[-4:].lower() == '.wav': | ||
source_files.append(os.path.join(source_folder, f)) | ||
target_files.append(os.path.join(target_folder, f)) | ||
print(f'Saved all the file paths, Number of files = {len(source_files)}', flush=True) | ||
|
||
# Convert the files to args.fs | ||
# Read with librosa and write the mono channel audio using soundfile | ||
print(f'Converting all files to {fs/1000} Khz', flush=True) | ||
for i, file_path in enumerate(source_files): | ||
y, sr = librosa.load(file_path, sr=fs, mono=True) | ||
sf.write(target_files[i], y, sr) | ||
list_completed.append(target_files[i]) | ||
if i % log_rate == 0: | ||
print(f'File Number {i+1}, Shape of Audio {y.shape}, Sampling Frequency {sr}', flush=True) | ||
|
||
print(f'Number of Files saved {len(list_completed)}') | ||
print('Done', flush=True) |
42 changes: 42 additions & 0 deletions
42
applications/KWS_Phoneme/auxiliary_files/download_youtube_data.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Copyright (c) Microsoft Corporation. All rights reserved. | ||
# Licensed under the MIT license. | ||
|
||
import csv | ||
import os | ||
import argparse | ||
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--csv_file', default=None, required=True) | ||
parser.add_argument('--target_folder', default=None, required=True) | ||
args = parser.parse_args() | ||
|
||
with open(args.csv_file, 'r') as csv_f: | ||
reader = csv.reader(csv_f, skipinitialspace=True) | ||
# Skip 3 lines ; Header | ||
next(reader) | ||
next(reader) | ||
next(reader) | ||
for row in reader: | ||
# Logging | ||
print(row, flush=True) | ||
# Link for the Youtube Video | ||
YouTube_ID = row[0] # "-0RWZT-miFs" | ||
start_time = int(float(row[1])) # 420 | ||
end_time = int(float(row[2])) # 430 | ||
# Construct downloadable link | ||
YouTube_link = "https://youtu.be/" + YouTube_ID | ||
# Output Filename | ||
output_file = f"{args.target_folder}/ID_{YouTube_ID}.wav" | ||
# Start time in hrs:min:sec format | ||
start_sec = start_time % 60 | ||
start_min = (start_time // 60) % 60 | ||
start_hrs = start_time // 3600 | ||
# End time in hrs:min:sec format | ||
end_sec = end_time % 60 | ||
end_min = (end_time // 60) % 60 | ||
end_hrs = end_time // 3600 | ||
# Start and End time args | ||
time_args = f"-ss {start_hrs}:{start_min}:{start_sec} -to {end_hrs}:{end_min}:{end_sec}" | ||
# Command Line Execution | ||
os.system(f"youtube-dl -x -q --audio-format wav --postprocessor-args '{time_args}' {YouTube_link}" + " --exec 'mv {} " + f"{output_file}'") | ||
print('', flush=True) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.