-
Notifications
You must be signed in to change notification settings - Fork 18
week 20.03 26.03.2017
Matthijs Van keirsbilck edited this page Mar 29, 2017
·
1 revision
-
Process Dataset
-
Fix dataset source files
-
TIMIT stores wavs in weird NIST format, we need to convert to normal WAV:
- use
transform.pyfor this:
python processDataset/fixDataset/transform.py wavs -i /home/matthijs/TCDTIMIT/TIMIT/original/TIMIT/ -o /home/matthijs/TCDTIMIT/TIMIT/processed
- use
-
TIMIT PHN files need to be mapped from 61 to 39 phonemes (mapping: see phoneme_set.py)
- use
transform.py:
python processDataset/fixDataset/transform.py phonemes -i /home/matthijs/TCDTIMIT/TIMIT/original/TIMIT/ -o /home/matthijs/TCDTIMIT/TIMIT/processed
- use
-
You can generate an MLF (Maste Label File) file containing all phoneme frame and label info:
python processDataset/fixDataset/createMLF.py /home/matthijs/TCDTIMIT/TIMIT/processed -
Specific for TCDTIMIT:
- Problems:
- Resample audio WAV files so they are all 16kHz (not 48kHz as in TCDTIMIT)
seeprocessDataset/fixDataset/resample.pyin audio repo (there b/c might also be needed for other databases) . Improvement: using resampy library
- Audio label files should use sample frames, not time
- Only MLF file, we need to extract seperate .PHN files -> seegetPhnFiles.pyin TCDTIMITprocessing repo - edit MLFfiles in TCDTIMITprocessing repo: make sure their paths are correct (use Search&Replace for that))
-
cdinto TCDTIMITprocessing - All of the following functions are taken care of in
extractTCDTIMITaudio.py. Edit it to specify source and target directories, then run.- extracting phonemes:
python getPhnFiles.py "./MLFfiles/lipspeaker_labelfiles.mlf" ~/TCDTIMIT/TCDTIMITaudio
python getPhnFiles.py "./MLFfiles/volunteer_labelfiles.mlf" ~/TCDTIMIT/TCDTIMITaudio - extracting TCDTIMIT wav files:
python helpFunctions/copyFilesOfType.py /media/matthijs/TOSHIBA_EXT/TCDTIMIT/ /home/matthijs/TCDTIMIT/TCDTIMITaudio "wav" - The files are all under one directory per speaker (eg Lipspkr1/Clips/straightcam/sa1.wav). Place them in a TIMIT standard file structure (Lipspkr1/sa1/sa1.wav):
python helpFunctions/fixTCDTIMITwavStructure.py /home/matthijs/TCDTIMIT/TCDTIMITaudio/processed/
- extracting phonemes:
- You'll then also still have to execute the
transform.py wav ...functions as with TIMIT data. This will take a while as the TCDTIMIT Wavs need to be resampled to 16kHz.
- Problems:
-
-
Use fixed dataset to generate .pkl files
- load in data + phonemes
- convert wav to mfcc
- label to classnumber
- label samplenr to frame window
- mean (+ std dev) normalization
- one hot encoding of targets
-
-
Use .pkl files as input to the network
- before running driver command (see here, at the bottom), edit and then do
source start.shin terminal. It's used for setting some environment variables.
- Other useful scripts:
- you can use
TCDTIMITprocessing/copyFilesOfTypefor finding and extracting all wavs, phns, npz models etc:
python3 fileDirOps.py ~/Documents/Dropbox/_MyDocs/_ku_leuven/Master_2/Thesis/convNets/code /media/matthijs/TOSHIBA_EXT/TCDTIMIT/zzzNPZmodels ".npz"- in-place replacing of all phonemes + counting phoneme occurences:
processDataset/fixDataset/substitute_phones.py
- in-place replacing of all phonemes + counting phoneme occurences:
- clean up a directory structure (remove empty directories) by calling
helpFunctions/removeEmptyDirs.pyFrom this source - visualization script:
processDataset/fixDataset/helpFunctions/wavToPng.pyproduces time-domain and freq-domain visualization of a wav file.
-
- reasonably written
- 61 phonemes -> needs mapping (I can use TimoVNiedek's script for that) maybe also hardoded stuff
- output = .klepto, needs to convert to npz. SHouldn't be hard
- RecNet, not Lasagne. Based on Theano, so I guess it should be similar
-
- really well written, OOP. seems quite easily adaptable
- Keras, so quite high level. Will need some changes, but preprocessing could be useful.
- 61 labels => :(
- executing:
- set proper paths and
source start.sh - in same terminal:
python driver.py model speech2phonemes train --data 10000 --summarize True
- set proper paths and
-
- well written, easy to follow
- tensorflow, so not necessarily suited for Theano
- they are doing some weird batch-thing in the preprocessing that I don't understand.
- uses 39 output phoneme classes -> can use to convert TIMIT labels.
-
- not so well written
- Lasagne = nice :)
- 61 phonemes -> needs mapping
2. Performance comparison with HMM (using HTK): this repo
- Gathering phonemes + counting : https://github.com/syhw/timit_tools/blob/master/src/create_phonesMLF_list_labels.py
python createMLF.py /home/matthijs/TCDTIMIT/TIMIT/original/timit/ - Converting to 39 phonemes: this script, using the mapping dict 'timit_foldings.json' (in Preprocessing folder):
python substitute_phones.py /home/matthijs/TCDTIMIT/TIMIT/fixed/TIMIT/ helpFunctions/timit_foldings.json
- A good reference file for Lasagne LSTM implementation.
- From : Graves2005, Framewise Phoneme Classification with Bidirectional LSTM Networks:
We found that large networks, of around 200,000 weights, gave good performance.
However, it is possible that smaller nets would have generalised better to the test set.
With the aim of keeping the number of parameters roughly constant,
we chose the following topologies:
• A unidirectional net with a hidden LSTM layer containing
205 memory blocks, with one cell each.
• A bidirectional net with two hidden LSTM layers (one
forwards and one backwards) each containing 140 one
cell memory blocks.
• A unidirectional net with a hidden layer containing 410
sigmoidal units.
• A bidirectional net with two hidden layers (one forwards
and one backwards) containing 280 sigmoidal units each.
All nets contained an input layer of size 26 (an input for each
MFCC coefficient), and an output layer of size 43.
The input layers were fully connected to the hidden layers and the hidden
layers fully connected to themselves and the output layers.
All LSTM blocks had the following activation functions:
logistic sigmoids in the range [−2, 2] for the input and output
squashing functions of the cell (g and h in Figure 1), and in
the range [0, 1] for the gates.
The non-LSTM nets had logistic sigmoid activations in the range [0, 1] in the hidden layers.
B. Output Layers
For the output layers, we used the cross entropy objective
function and the softmax activation function, as is standard
for 1 of K classification [4]. The softmax function ensures
that the network outputs are all between zero and one, and
that they sum to one on every timestep. This allows them to
be interpreted as the posterior probabilities of the phonemes at
a given frame, given all the inputs up to the current one (with
unidirectional nets) or all the inputs in the whole sequence
(with bidirectional nets).
Several alternative objective functions have been studied for
this task [7]. One modification in particular has been shown to
have a positive effect on full speech recognition (though not
necessarily on framewise classification). This is to weight the
error according to the duration of the current phoneme, which
ensures that short phonemes are as significant to the training
as longer ones.
C. Network Training
All nets were trained with gradient descent (error gradient
calculated with BPTT), using a learning rate of 10^−5 and
a momentum of 0.9. At the end of each utterance, weight
updates were carried out and network activations were reset
to 0.
For the unidirectional nets a delay of 4 timesteps was
introduced between the target and the current input — i.e. the
net always tried to predict the phoneme it had seen 4 timesteps
ago.
- in lipspeaker part: map frame numbers to times in video. The audio also works with times, we need it to sync properly.