|
| 1 | +Key Word Spotting Example |
| 2 | +============================================== |
| 3 | +Example shows implementation of small speech recognition use case for Key Word Spotting (KWS). [TensorFlow speech commands tutorial](https://www.tensorflow.org/tutorials/sequences/audio_recognition) was used as basis for neural network model training with the following most notable changes: |
| 4 | +1) Input features: FBANK features instead of MFCC |
| 5 | +2) Model architecture: Depthwise separable convolutions + LSTM Cell. |
| 6 | +KWS Modules are designed to process audio stream. The rest example code wraps modules to process single WAV file and output performance measurements. |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | +Quick Start |
| 11 | +-------------- |
| 12 | + |
| 13 | +Example supports building with [MetaWare Development tools](https://www.synopsys.com/dw/ipdir.php?ds=sw_metaware) and running with MetaWare Debugger on [nSim simulator](https://www.synopsys.com/dw/ipdir.php?ds=sim_nSIM). Building with ARC GNU Toolchain isn’t supported yet. |
| 14 | + |
| 15 | +### Build with MetaWare Development tools |
| 16 | + |
| 17 | +Here we will consider building for [/hw/em9d.tcf](/hw/em9d.tcf) template. This template is a default template for this example. Other templates can be also used. |
| 18 | + |
| 19 | +0. embARC MLI Library must be built for required hardware configuration first. See [embARC MLI Library building and quick start](/README.md#building-and-quick-start). |
| 20 | + |
| 21 | +1. Open command line and change working directory to `./examples/example_kws_speech` |
| 22 | + |
| 23 | +2. Example uses functions from DSP library and float point calculations for audio processing. For optimal performance C runtime libraries should be rebuilt for the target hardware template. Enter the following command in console (building may take 10-15 minutes): |
| 24 | + |
| 25 | + buildlib em9d_rt_libs -bd ./ -tcf="../../hw/em9d.tcf" |
| 26 | + |
| 27 | +2. Clean previous example build artifacts (optional): |
| 28 | + |
| 29 | + gmake clean |
| 30 | + |
| 31 | +3. Build example with pointing the runtime libraries which had been built on step 2: |
| 32 | + |
| 33 | + gmake TCF_FILE=../../hw/em9d.tcf RT_LIB=./em9d_rt_libs |
| 34 | + |
| 35 | +### Run example with MetaWare Debuger on nSim simulator. |
| 36 | + |
| 37 | +Example application requires path to wav file (16kHz sample rate, 16bit samples, mono). Test sample inside the example directory might be used: |
| 38 | + |
| 39 | + gmake run TCF_FILE=../../hw/em9d.tcf RUN_ARGS=./test.wav |
| 40 | + |
| 41 | +Expected console output: |
| 42 | + |
| 43 | + 0.240 1.215 "on"(99.994%) |
| 44 | + 1.440 2.415 "stop"(99.997%) |
| 45 | + 3.600 4.575 "_unknown_"(99.960%) |
| 46 | + |
| 47 | +The first and the second columns are start and end timestamps in seconds correspondingly. The third column is a detected command with its probability in brackets. To see full list of commands see the corresponding [module explanation](#kws-modules-explanation). |
| 48 | + |
| 49 | +### Build with ARC GNU toolchain |
| 50 | +Build with ARC GNU Toolchain isn’t supported now. |
| 51 | + |
| 52 | + |
| 53 | +Example Structure |
| 54 | +-------------------- |
| 55 | +Structure of the example application may be logically divided to next parts: |
| 56 | + |
| 57 | +* **Application.** Implements allocation of fast memory and modules initialization. Input WAV file reading and KWS Module feeding with samples also related to this part. |
| 58 | + * main.cc |
| 59 | + * wav_file_guard.cc |
| 60 | + * wav_file_guard.h |
| 61 | +* **KWS Results postprocessor.** Assuming KWS modules outputs arrays with probabilities each Nth input frame, this code is intended to analyze result series and provide final decision. |
| 62 | + * simple_kws_postprocessor.cc |
| 63 | + * simple_kws_postprocessor.h |
| 64 | +* **Audio Features extractor.** Transformation of input audio waveform to more compact feature set which are easier for further analysis. |
| 65 | + * audio_features/audio_features.cc |
| 66 | + * audio_features/audio_features.h |
| 67 | +* **Common interface for KWS Modules.** Common interface declaration for KWS modules (including data types and module builder). |
| 68 | + * kws/kws_factory.h |
| 69 | + * kws/kws_module.h |
| 70 | + * kws/kws_types.h |
| 71 | +* **KWS modules implementation.** Implements logic of common KWS interface for specific NN graph. Uses embARC MLI Library for performing NN part of processing. |
| 72 | + * dsconv_lstm_nn/* |
| 73 | + |
| 74 | +Example structure contains test WAV file which consist of 3 samples of Google speech dataset. |
| 75 | +* test.wav |
| 76 | + |
| 77 | +More Options on Building and Running |
| 78 | +--------------------------------------- |
| 79 | +To see profiling measures in console output (cycles count and memory requirements) pass the “-info” option after wav file path to application: |
| 80 | + |
| 81 | + gmake run TCF_FILE=../../hw/em9d.tcf RUN_ARGS=”./test.wav -info” |
| 82 | + |
| 83 | + |
| 84 | +KWS Modules explanation |
| 85 | +---------------------------- |
| 86 | +### Depthwise-Separable convolution with LSTM backend layer (kws/dsconv_lstm_nn) |
| 87 | + |
| 88 | +In short: |
| 89 | +* Module distinguish 10 key words (standard TF tutorial set including yes, no, up, and etc.) plus “silence” and “unknown” |
| 90 | +* Filter Banks (FBANKs) features extracted from input audio frames as input for neural network |
| 91 | +* Layer 1: Convolution and Leaky ReLU activation (slope = 0.2) |
| 92 | +* Layers 2-4: Depthwise + pointwise convolution with Leaky ReLU activation (slope = 0.2) |
| 93 | +* Average pooling across frequency dimension after layer 4 |
| 94 | +* Layer 5: LSTM sequential processing across time dimension |
| 95 | +* Layer 6: Fully connected + softmax. |
| 96 | + |
| 97 | +List of commands: |
| 98 | ++ "yes" |
| 99 | ++ "no" |
| 100 | ++ "up" |
| 101 | ++ "down" |
| 102 | ++ "left" |
| 103 | ++ "right" |
| 104 | ++ "on" |
| 105 | ++ "off" |
| 106 | ++ "stop" |
| 107 | ++ "go" |
| 108 | + |
| 109 | +All the rest commands beside the list are marked as “\_unknown\_” |
| 110 | + |
| 111 | + |
| 112 | +Module takes mono input 16bit PCM audio stream of 16kHz sample rate by frames of 15ms length (240 samples – stride size). FBANK Feature are calculated from 30ms long frame extracting 13 values from each frame. |
| 113 | + |
| 114 | +When input features sequence is long enough to perform complete NN Inference (features for 65 frames), module invokes MLI based NN implementation of the following architecture: |
| 115 | + |
| 116 | + |
| 117 | +The following table provides accuracy performance of trained model and it’s MLI based version for various bit depth of quantized data: |
| 118 | + |
| 119 | + |
| 120 | +| | TF Baseline (Float) | MLI FX16 | MLI FX8 | MLI <br/> DSCNN: FX8 <br/> LSTM: FX8w16d | |
| 121 | +| :----------------------------------------------------: | :-------------------: | :-----------------: | :--------------------: | :----------------------------------------: | |
| 122 | +| Test dataset accuracy <br/> [diff with baseline] | 95.89% | 95.89% <br/> [==] | 94.826% <br/> [-1.06%] | 95.481% <br/> [-0.41%] | |
| 123 | + |
| 124 | + |
| 125 | +Version that corresponds to the last column is implemented in the module. To keep trained weights compact, all of them has been quantized to 8bit signed values. To minimize buffers for intermediate result, inputs and outputs of first 4 layers (convolutions) are also of 8bit depth. Activations of last convolution are transformed to 16bit depth to reduce the effect of the error accumulation in the recurrent LSTM cell. |
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +References |
| 130 | +---------------------------- |
| 131 | +Simple Audio Recognition Tutorial: |
| 132 | +> TensorFlow - Simple Audio Recognition. https://www.tensorflow.org/tutorials/sequences/audio_recognition |
| 133 | +
|
| 134 | +Speech commands dataset: |
| 135 | +> Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018 |
| 136 | +
|
| 137 | +Explanation on FBANKs and MFCC Features: |
| 138 | +> Haytham Fayek. Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between. Blog post. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html |
| 139 | +
|
| 140 | + |
0 commit comments