Skip to content

Inference Framework

avidov edited this page Jan 12, 2020 · 7 revisions

Table of Contents

Overview

wav2letter@everywhere is a multithreaded and a multi-platform library for researchers, production engineers and students to quickly put together trained DNN modules for online inference. This document is a user guide with in depth description of the streaming library architecture and tradeoffs.

Why Should You Use it?

The streaming inference DNN processing graph, allows researchers, production engineers, or to quickly:

  • Build streaming speech recognition system using wav2letter++.
  • Easily load trained modules into memory and processing efficient processing graph.
  • Unlimited number of concurrent processing streams.
  • Loaded modules are compressed for efficient memory use while maintaining high throughput and very low latency.
  • Currently released version supports hosts and soon we’ll release a version that supports Android and IOS.
  • Load existing or your own trained modules from wav2letter training frameworks.

Features

Module Library

The inference streaming framework come packed with ASR lego-block modules such as fully connected linear layer, streaming convolution, decoder, activation functions, feature extraction, etc.

Trained modules

Trained english module are freely available at: ...

Conversion tool

Easily extensible tool imports trained modules into inference format which can use less memory Currently supporting 16bit floating point internal representation. Converting 32bit modules to 16bit reduces the memory size while keeping high inference quality.

The tools currently supporting input from wav2letter, but users are encouraged to extend this simple tool for any needed input format.

Serialization

The inference streaming framework supports serialization into binary files. It also supports serialization into JSON and XML formats and serialization into any streamable destination. The user can later read such file to create a fully functional streaming inference DNN.

Multithreading

Single configured module supports any number of concurrent streams. This architecture maximizes efficiency while minimizing memory use.

Multi Platform

The streaming inference architecture is easily expandable to new platform. Currently we release the FBGEMM backend for hosts and servers. Soon, we’ll release the Android and IOS backends for on-device inference.

Free and open source

Wav2letter and wav2letter-inference are free and open-source projects supported by Facebook scientists and engineers and by the community.

Software architecture

This section describes the streaming library main abstructions

Inference Module

Inference module is the base abstraction for objects that process streaming input to output. It is the base class for DNN modules, activation functions, and composite modules. Composite modules chain modules, activation functions, and possibly other composite modules.

alt_text

Composite module are composed of simpler modules. For example, the TDSBlock is a subclass of Sequential and is composed of other modules including Residual modules. In fact the TDSBlock also has Activation functions. Each module has a memory manager. The module is using the memory manager for allocating temporary workspaces for processing. Users can use the default memory manager or extend that class in order to optimized for specific cases, collect stats etc.

class InferenceModule {
  Public:
     using StatePtr = std::shared_ptr<ModuleProcessingState>;
 
     virtual StatePtr start(StatePtr input) = 0;
 
     virtual StatePtr run(StatePtr input) = 0;
 
     virtual StatePtr finish(StatePtr input) {
           return run(input);
      }
 
   void setMemoryManager(std::shared_ptr<MemoryManager> memoryManager);
}

Inference module processes stream using the three methods: start(), run() and finish(). The user should call start() at the beginning of the stream, run() to process streaming input as it comes, and finish() at the end of the stream. Each of these methods are taking a shared pointer of ModuleProcessingState as input and returns another object of the same type as output. In fact, the same input object always returns the same output object regardless which method is called and how many times.

Module Processing State

Stream processing requires keeping some intermediate state. We keep that state in a vector of buffers per stream per module (only for modules that need it). ModuleProcessingState abstracts the complete state per stream as a linked list from first input to final output.

alt_text

Module processing state is a linked list node. The create a ModuleProcessingState, write one or more of its buffers first input in the stream, and call start(). start() allocate an output ModuleProcessingState and sets it as its next in the list. The user gets that output. The buffer(s) of the output hold the result. Complex modules may create multiple links in the ModuleProcessingState list between the user’s input and the returned output.

alt_text mps is short for ModuleProcessingState in the diagram above, holds the state per stream

class ModuleProcessingState {
Public:
  shared_ptr<IOBuffer>>& buffer(int index);
 
  vector<shared_ptr<IOBuffer>>& buffers();
 
  shared_ptr<ModuleProcessingState> next(bool createIfNotExists = false);
 
private:
 std::vector<std::shared_ptr<IOBuffer>> buffers_;
 std::shared_ptr<ModuleProcessingState> next_;
};

IOBuffer

The IOBuffer is simple self growing memory buffer. You'll mostly use these to write input and read output. A ModuleProcessingState is in fact a vector of these. Users and modules may use the IOBuffer for different types. For that this end we have the templated methods that allow to access the buffer as a buffer of any type. The size in all these methods is in the size of the specified type.

class IOBuffer {
public:
 template <typename T> void write(const T* buf, int size);
 
 template <typename T> void consume(int size);
 
 template <typename T>  int size() const;
 
 template <typename T> T* data();
}
 

Example use of member template methods:

vector<float> myInput = {..};
inputBuffer->write<float>(myInput.data(), myInput.size());
float* bufPtr = inputBuffer->data<float>();

Memory Manager

When modules need workspace for calculations, they asks the memory manager. The memory manager allows user to create specialized memory managers by subclassing the MemoryManager. If user does not set up a memory manager then the DefaultMemoryManager is used. This manager simply calls malloc and free.

Flexible Backend

Modules that can use backed acceleration, such as Conv1d and Linear layers, are instantiated using a factory function. The factory function creates a subclass object that is accelerated and optimized for the current architecture. For example createLinear() is declared at Linear.h in the module/nn directory.

using ParamPtr = std::shared_ptr<ModuleParameter>;
 
std::shared_ptr<Linear> createLinear(
   int nInput,
   int nOutput,
   ParamPtr weights,
   ParamPtr bias);

creatLinear() return a subclass of Linea that uses the best backend for the current architecture.

Architecture is specified at build time by setting W2L_INFERENCE_BACKEND.

cmake -DW2L_INFERENCE_BACKEND=fbgemm

This will create a Makefile that picks the backend implemented in the fbgemm source directory.

inference
 ├── common
 ├── decoder
 └── module
     ├── Linear.h
     └── nn
          └── backend
               └── fbgemm
                   ├── LinearFbGemm.cpp

The function createLinear() is implemented in LinearFbGemm.cpp

using ParamPtr = std::shared_ptr<ModuleParameter>;
 
std::shared_ptr<Linear> createLinear(
   int nInput,
   int nOutput,
   ParamPtr weights,
   ParamPtr bias) {
 return std::make_shared<LinearFbGemm>(nInput, nOutput, weights, bias);
}

It returns LinearFbGemm, a subclass of Linear that uses the FBGEMM library which is optimized for high performance, low precision (16bit FP) on x86 machines,

Examples

The supplied examples can be used directly to quickly bootstrap a demo. There are two simple usilities:

  • simple_wav2letter_example can by used to quickly transcribe a single audio file.
  • multithreaded_wav2letter_example can by used to quickly transcribe many audio files.

Download the example module from AWS S3

~$>ls -sh ~/model/
total 270M
254M acoustic_model.bin  
1.0K arch.txt	 
512 decoder_options.json   
512 feature_extractor.bin   
13M language_model.bin	
4.0M lexicon.txt   
82K tokens.txt

Download audio samples from

~$> wget -qO- http://www.openslr.org/resources/12/train-clean-100.tar.gz | tar xvz
~$> wget -qO- http://www.openslr.org/resources/12/dev-clean.tar.gz | tar xvz
~$> wget -qO- http://www.openslr.org/resources/12/test-clean.tar.gz | tar 

avidov@devfair0325:~/wav2letter/build$

~:> ls audio sample0001.wav sample0002.wav sample0003.wav

~:> file audio/* audio/sample0001.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz audio/sample0002.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz audio/sample0003.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

 
Ideal audio file length is 15sec and format 16khz .WAV file.
Say wew have our audio samples at ~/audio.
 
```Shell Session
~:> ls audio
sample0001.wav  sample0002.wav  sample0003.wav
 
~:> file audio/*
audio/sample0001.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
audio/sample0002.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
audio/sample0003.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
~:> git clone https://github.com/facebookresearch/wav2letter.git
~:> mkdir build
~/build:> cd build
~/build:> cmake W2L_BUILD_LIBRARIES_ONLY..
~/build:> make -j$(nproc)
...

 
 
The examples come packed with a serilized Mel feature extraction model.
```Shell Session
wget https://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/

ls -sh ~/model/
total 270M
254M acoustic_model.bin  1.0K arch.txt	 512 decoder_options.json   512 feature_extractor.bin   13M language_model.bin	4.0M lexicon.txt   82K tokens.txt
avidov@devfair0325:~/wav2letter/build$
~/wav2letter/build:> ../src/inference/examples/multithreaded_wav2letter_example \
  --input_audio_file_of_paths=~/input_files_paths.txt
  --output_files_base_path=~/audio/output --lexicon_file=~/lexicon.txt
$>cat audio.wav | simple_wav2letter_example --input_audio_file="" --lexicon_file=/home/avidov/lexicon.txt
 
$>cat /mnt/vol/gfsai-east/ai-group/users/vineelkpratap/streaming/audio.wav |  /home/avidov/local/fbsource/fbcode/buck-out/opt
/gen/deeplearning/projects/wav2letter/src/inference/examples/simple_wav2letter_example --input_audio_file="" --lexicon_file=/home/avidov/lexicon.txt 2>&1  | tee /tmp/simple_wav2letter_example
Started features model file loading ...
Completed features model file loading elapsed time=25737microseconds
 
Started acustic model file loading ...
Completed acustic model file loading elapsed time=2783milliseconds
 
Started tokens file loading ...
Completed tokens file loading elapsed time=4650microseconds
 
Tokens loaded - 9998 tokens
Started create decoder ...
[Letters] 9998 tokens loaded.
[Words] 200001 words loaded.
Completed create decoder elapsed time=19625milliseconds
 
Started converting audio input from stdin to text... ...
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
1000,2000,
2000,3000,uncle julia said
3000,4000,and auntie
4000,5000,helen can't to finish
5000,6000,my toilet
6000,7000,while they were making theirs
7000,8000,
8000,9000,there now
9000,10000,you have nothing to complain
10000,11000,of in the way of
11000,12000,looks she remarked at
11601.1,12202.1,the completion
11601.1,11601.1,of the ceremony
Completed converting audio input from stdin to text... elapsed time=2395milliseconds

multithreaded_wav2letter_example translates any number of audio files to equivalent text:

Code Example

Create a simple module

Creating a module is simple. All you need is the module parameters values. Next we wrap the values with ModuleParameter object and create the module directly.

#include "inference/module/Module.h"
#include "inference/module/Conv1dCreate.h"
 
// Create or load the parameter raw data
std::vector<float> convWeights = {-0.02, 0.21, .. }
std::vector<float> convBias = { 0.1, -0.2 }
 
// Use the raw data to create inference parameter objects.
const auto convWeightParam = std::make_shared<ModuleParameter>(
   DataType::FLOAT, convWeights.data(), convWeights.size());
const auto convBiasParam = std::make_shared<ModuleParameter>(
   DataType::FLOAT, convBias.data(), convBias.size());
 
// Create a configured DNN module.
auto conv = Conv1dCreate(
   inputChannels,
   outputChannels,
   kernelSize,
   stride,
   {leftPadding, rightPadding},
   groups,
   convWeightParam,
   convBiasParam);
 

Assemble Complex Modules

Complex networks are assembled from simple layers using the Sequential module.

auto linear = LinearCreate(inputChannels,
                       outputChannels,
                       linearWeightParam,
                       linearBiasParam);
 
auto layerNorm = std::make_shared<LayerNorm>(channels, layerNormWeight, layerNormbias);
 
auto sequence = std::make_shared<Sequential>();
sequence->add(conv);
sequence->add(std::make_shared<Relu>(dataType));
sequence->add(layerNorm);
sequence->add(std::make_shared<Relu>(dataType));
sequence->add(linear);

Process Input

auto input = std::make_shared<ModuleProcessingState>(1);
auto output = sequence->start(input)
 
std::shared_ptr<IOBuffer> inputBuffer = input->buffer(0);
std::shared_ptr<IOBuffer> outputBuffer = output->buffer(0);
 
while (yourInputSource.hasMore()) {
 vector<float> yourInput = yourInputSource.nextChunk();
 inputBuffer>write<float>(yourInput.data(), yourInput.size());
 
 // Run the module on the next input.
 // The buffers of the output are updates. The output object is the same one that
 // returns by start() and for every call of run().
 output = sequence->run(input);
 
 UseTheResult(outputBuffer->data<float>(), outputBuffer->size<float>();)
}
 
output = sequence->finish(input);
UseTheResult(outputBuffer->data<float>(), outputBuffer->size<float>();)

Serialize

// Save sequence to a binary file.
{
  ofstream myfile("dnn.bin");
  cereal::BinaryOutputArchive archive(myfile);
  archive(sequence);
}
 
// Load sequence from a binary file.
{
 ifstream myfile("dnn.bin");
 std::shared_ptr<Sequential> sequence;
 cereal::BinaryInputArchive archive(myfile);
 archive(sequence);
 
… sequence->run(...)
}

Conclusion

wav2letter@anywhere is a high performance, low overhead, multi-threaded, multi-platform framework for quickly assembling ASR inference for research and for embedding in products.

Clone this wiki locally