Hi,
I have some questions about the way you reconstruct the audio from the latent space (https://github.com/swasun/VQ-VAE-Speech/blob/master/src/models/wavenet_vq_vae.py#L114). It seems that you are using x_dec which supposingly is the original audio signal. Do you have any kind of study on only using the encoded discrete variables for reconstruction? What would the performance be?
Thanks!