week 31.10 06.11

This week's tasks:

Summary of TCD-TIMIT (nb of phonemes, nb/time of videos,...)
get labels at right times (middle of phoneme interval) -> write script
extract images at right time out of the videos -> write script
resize/compress images for faster processing in networks (possibly grayscale as well?)
determine training/validation split
add images as inputs and labels as output for a pre-trained network (Resnets)
if training takes too long, make custom network based on Lasagne examples

Parametrize scripts!

Summary of TCD TIMIT

English: 39 used phonemes (out of 61 possible labels, simplify very similars + all kinds of pauses) visemes not well defined
TCD-TIMIT: 62 speakers reading a total of 6913 sentences. its speech material is 2255 sentences from TIMIT 3 lipspeakers; Volunteers say 98 sentences each, while the lipspeakers say 377 sentences each. Visual and audio-visual baseline results on the non-lipspeakers were low overall. Results on the lipspeakers were found to be significantly higher than the non-lipspeaker results. Views: frontal and 30* (lip protrutions vs mouth width etc)
ROI extraction
TIMIT (audio only): 6300 sentences spoken by 630 speakers, from 8 USA regions. WAV file + PHN file + WRD file

[CTC decoding network](Maas, et al.”Lexicon-Free Conversational Speech Recognition with Neural Networks.” NAACL, 2015.)

Get ROI (mouth)

TCD TIMIT did this (see MAT files) based on L. Cappelletta thesis (VidTIMIT) using nostril detection

Get Labels

time to get: middle of phoneme interval (=> best image that represents that phoneme) Parametrize the time! source file: lipspeaker_labelfiles.mlf
Fix the paths in the MLF file before running all the scripts!!

"Q:/Videos/TCD-TIMIT/lipspeakers/Lipspkr1/Clips/straightcam/sa1.rec"
0 10300000 sil
10300000 11800000 sh
11800000 12300000 iy
12300000 14000000 hh
14000000 14700000 ae

-> get (1.03 + 1.18)/2 = 1.105 for sh

Goal formatting:

"lipspeakers/Lipspkr1/Clips/straightcam/sa1.mp4"
1.105 sh
1.205 iy
...

Extract images

for 1 frame, this works: ffmpeg -ss 00:00:01.615 -i sa1.mp4 -frames:v 1 sa1.jpg (extracts the image at 1 second, 615us into the video)
get correct time from the labels

ffmpeg
-> not needed anymore because we're using the .mat files

Process images

lower resolution, eg 224x224x3 for resnet, or even lower
possibly convert to grayscale to reduce complexity? (not if using pretrained resnet)

Train vs Val

90% - 10% ?

Pretrained network

ResNet-152
ResNet

Custom network

Lasagne Model
Remove layers to make it smaller, see how well it does.
Use eg 5-6 layers instead of 19

Home
general idea
software used
links
work-overview
thesis-conversations

week 31.10 06.11

Summary of TCD TIMIT

Get ROI (mouth)

Get Labels

Extract images

Process images

Train vs Val

Pretrained network

Custom network

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally