End-to-end speech enhancement models using deep learning
Suppressing background noise from recorded speech (a.k.a. speech enhancement) has widespread applications in mobile communication, voice activated systems and hearing aids. Recently, deep neural network (DNN)-based approaches gained success in speech enhancement as they can efficiently learn the speech and noise statistics from a training dataset containing noisy speech and clean speech pairs. Popular neural network-based speech enhancement systems operate on the magnitude spectrogram and ignore the phase mismatch between the noisy and clean speech signals. However, it is well established that speech quality can be significantly improved when the clean phase spectrum is known. This talk therefore concentrates on recent approaches for DNN-based speech enhancement in the time-domain. Such models are also known as end-to-end models since the time-domain speech signal is not transformed into any other domain before feeding it to the DNN.
In particular, we will discuss end-to-end models using the deep convolutional neural networks (DeepCNN) that map the time-domain noisy speech signal to the underlying clean speech signal. In addition, conditional generative adversarial networks (cGANs) have been recently shown promise in improving DeepCNN-based speech enhancement. This talk will introduce the DeepCNN approaches and their GAN-variants, the model architecture, loss functions and training strategies.