Skip to content

general idea

Matthijs Van keirsbilck edited this page Mar 29, 2017 · 1 revision

General idea

The idea is to combine speech recognition and image processing:

  • speech recognition: processes audio and tries to guess the pronounced words based on perceived frequencies. We have decent performance with convolutional NNs combined with a RNN for classification, but it's not robust when there's background noise.
  • image processing: process a video/sequence of images and try to guess the pronounced words based on mouth movements. We have okay performance with this, but it can be used to complement audio processing as it doesn't care about background noise. Uses more bandwidth and processing power though

=> use speech because lower energy cost, but when performance goes down b/c of noise, mix in some image processing to increase robustness to bad audio.

If this works reasonably well, we can start mapping it to dedicated HW.

Clone this wiki locally