week 20.03 26.03.2017

Program Structure

Process Dataset
1. Fix dataset source files
  - TIMIT stores wavs in weird NIST format, we need to convert to normal WAV:
    - use transform.py for this:
      python processDataset/fixDataset/transform.py wavs -i /home/matthijs/TCDTIMIT/TIMIT/original/TIMIT/ -o /home/matthijs/TCDTIMIT/TIMIT/processed
  - TIMIT PHN files need to be mapped from 61 to 39 phonemes (mapping: see phoneme_set.py)
    - use transform.py:
      python processDataset/fixDataset/transform.py phonemes -i /home/matthijs/TCDTIMIT/TIMIT/original/TIMIT/ -o /home/matthijs/TCDTIMIT/TIMIT/processed
  - You can generate an MLF (Maste Label File) file containing all phoneme frame and label info:
    python processDataset/fixDataset/createMLF.py /home/matthijs/TCDTIMIT/TIMIT/processed
  - Specific for TCDTIMIT:
    - Problems:
      - Resample audio WAV files so they are all 16kHz (not 48kHz as in TCDTIMIT)
      see processDataset/fixDataset/resample.py in audio repo (there b/c might also be needed for other databases) . Improvement: using resampy library
      - Audio label files should use sample frames, not time
      - Only MLF file, we need to extract seperate .PHN files -> see getPhnFiles.py in TCDTIMITprocessing repo
    - edit MLFfiles in TCDTIMITprocessing repo: make sure their paths are correct (use Search&Replace for that))
    - cd into TCDTIMITprocessing
    - All of the following functions are taken care of in extractTCDTIMITaudio.py. Edit it to specify source and target directories, then run.
      - extracting phonemes:
        python getPhnFiles.py "./MLFfiles/lipspeaker_labelfiles.mlf" ~/TCDTIMIT/TCDTIMITaudio
        python getPhnFiles.py "./MLFfiles/volunteer_labelfiles.mlf" ~/TCDTIMIT/TCDTIMITaudio
      - extracting TCDTIMIT wav files:
        python helpFunctions/copyFilesOfType.py /media/matthijs/TOSHIBA_EXT/TCDTIMIT/ /home/matthijs/TCDTIMIT/TCDTIMITaudio "wav"
      - The files are all under one directory per speaker (eg Lipspkr1/Clips/straightcam/sa1.wav). Place them in a TIMIT standard file structure (Lipspkr1/sa1/sa1.wav):
        python helpFunctions/fixTCDTIMITwavStructure.py /home/matthijs/TCDTIMIT/TCDTIMITaudio/processed/
    - You'll then also still have to execute the transform.py wav ... functions as with TIMIT data. This will take a while as the TCDTIMIT Wavs need to be resampled to 16kHz.
2. Use fixed dataset to generate .pkl files
  - load in data + phonemes
  - convert wav to mfcc
  - label to classnumber
  - label samplenr to frame window
  - mean (+ std dev) normalization
  - one hot encoding of targets
Use .pkl files as input to the network

before running driver command (see here, at the bottom), edit and then do source start.sh in terminal. It's used for setting some environment variables.

Other useful scripts:

you can use TCDTIMITprocessing/copyFilesOfType for finding and extracting all wavs, phns, npz models etc:
python3 fileDirOps.py ~/Documents/Dropbox/_MyDocs/_ku_leuven/Master_2/Thesis/convNets/code /media/matthijs/TOSHIBA_EXT/TCDTIMIT/zzzNPZmodels ".npz"
- in-place replacing of all phonemes + counting phoneme occurences: processDataset/fixDataset/substitute_phones.py
clean up a directory structure (remove empty directories) by calling helpFunctions/removeEmptyDirs.py From this source
visualization script: processDataset/fixDataset/helpFunctions/wavToPng.py produces time-domain and freq-domain visualization of a wav file.

1. Trying out different implementations:

Phoneme_recognition:
- reasonably written
- 61 phonemes -> needs mapping (I can use TimoVNiedek's script for that) maybe also hardoded stuff
- output = .klepto, needs to convert to npz. SHouldn't be hard
- RecNet, not Lasagne. Based on Theano, so I guess it should be similar
SpokenCommandProcessor:
- really well written, OOP. seems quite easily adaptable
- Keras, so quite high level. Will need some changes, but preprocessing could be useful.
- 61 labels => :(
- executing:
  - set proper paths and source start.sh
  - in same terminal: python driver.py model speech2phonemes train --data 10000 --summarize True
Phoneme_ctc:
- well written, easy to follow
- tensorflow, so not necessarily suited for Theano
- they are doing some weird batch-thing in the preprocessing that I don't understand.
- uses 39 output phoneme classes -> can use to convert TIMIT labels.
KGP-ASR:
- not so well written
- Lasagne = nice :)
- 61 phonemes -> needs mapping

2. Performance comparison with HMM (using HTK): this repo

3. Phoneme file (.phn) handling (see Preprocessing folder)

Gathering phonemes + counting : https://github.com/syhw/timit_tools/blob/master/src/create_phonesMLF_list_labels.py
python createMLF.py /home/matthijs/TCDTIMIT/TIMIT/original/timit/
Converting to 39 phonemes: this script, using the mapping dict 'timit_foldings.json' (in Preprocessing folder):
python substitute_phones.py /home/matthijs/TCDTIMIT/TIMIT/fixed/TIMIT/ helpFunctions/timit_foldings.json

4. Network structure

A good reference file for Lasagne LSTM implementation.
From : Graves2005, Framewise Phoneme Classification with Bidirectional LSTM Networks:

We found that large networks, of around 200,000 weights, gave good performance. 
However, it is possible that smaller nets would have generalised better to the test set. 
With the aim of keeping the number of parameters roughly constant,
we chose the following topologies:
• A unidirectional net with a hidden LSTM layer containing
    205 memory blocks, with one cell each.
• A bidirectional net with two hidden LSTM layers (one
    forwards and one backwards) each containing 140 one
    cell memory blocks.
• A unidirectional net with a hidden layer containing 410
    sigmoidal units.
• A bidirectional net with two hidden layers (one forwards
and one backwards) containing 280 sigmoidal units each.

All nets contained an input layer of size 26 (an input for each
MFCC coefficient), and an output layer of size 43. 
The input layers were fully connected to the hidden layers and the hidden
layers fully connected to themselves and the output layers.
All LSTM blocks had the following activation functions:
logistic sigmoids in the range [−2, 2] for the input and output
squashing functions of the cell (g and h in Figure 1), and in
the range [0, 1] for the gates. 
The non-LSTM nets had logistic sigmoid activations in the range [0, 1] in the hidden layers.

B. Output Layers
For the output layers, we used the cross entropy objective
function and the softmax activation function, as is standard
for 1 of K classification [4]. The softmax function ensures
that the network outputs are all between zero and one, and
that they sum to one on every timestep. This allows them to
be interpreted as the posterior probabilities of the phonemes at
a given frame, given all the inputs up to the current one (with
unidirectional nets) or all the inputs in the whole sequence
(with bidirectional nets).
Several alternative objective functions have been studied for
this task [7]. One modification in particular has been shown to
have a positive effect on full speech recognition (though not
necessarily on framewise classification). This is to weight the
error according to the duration of the current phoneme, which
ensures that short phonemes are as significant to the training
as longer ones.

C. Network Training
All nets were trained with gradient descent (error gradient
calculated with BPTT), using a learning rate of 10^−5 and
a momentum of 0.9. At the end of each utterance, weight
updates were carried out and network activations were reset
to 0. 
For the unidirectional nets a delay of 4 timesteps was
introduced between the target and the current input — i.e. the
net always tried to predict the phoneme it had seen 4 timesteps
ago.

TODO

in lipspeaker part: map frame numbers to times in video. The audio also works with times, we need it to sync properly.

Home
general idea
software used
links
work-overview
thesis-conversations

week 20.03 26.03.2017

Program Structure

1. Trying out different implementations:

2. Performance comparison with HMM (using HTK): this repo

3. Phoneme file (.phn) handling (see Preprocessing folder)

4. Network structure

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally