Prediction of speech intelligibility with deep neural networks and automatic speech recognition: Influence of training noise on model predictions
Accurate models of speech intelligibility (SI) can help to optimize speech enhancement algorithms, and reference-free SI models could potentially also serve as model-in-the-loop for real-time monitoring of SI in listening devices. Such models have to work without a speech or a noise reference. Spille et al. [(2018) Comp. Speech & Lang. doi:10.1016/j.csl.2017.10.004] created a model for predicting the speech reception threshold in noise based on a deep neural network (DNN) and automatic speech recognition (ASR). This model was blind to the speech signals (since it was based on a speaker-independent ASR system), but used the same noise signals for training and testing. This bears the risk of overfitting the model to the specific noise signals.
To investigate if overfitting plays a major role in this context, we modified the training procedure of the original model. Instead of using the same noise signal for training and testing we used the same noise source but different noise signals, which should be especially challenging when the source of the noise is a competing talker. This modification is one step in the direction of creating SI models that do not require a speech or noise reference. To test the DNN-based model, Spille et al. (2018) used eight different noise types that ranged from speech-shaped noise to a single talker. Six of the noises were derived from the international speech test signal (ISTS) that has a length of 60 seconds. To achieve a sufficient amount of noise samples in the current study, we created a new noise signal which resembles the ISTS. On this basis, new noises were generated, each with a length of 11 hours. For the training and testing procedure of the DNN the noise signals were split in two parts. Approximately 80% of each noise were used for training and 20% for testing. The results of our modified model are similar to the results of Spille et al. (2018): The predictions for 50% speech reception threshold are with an RMS error below 2.5 dB close to the results of the original model with an RMS error of 1.9. Both models outperform the baseline models such as the SII with an RMS error of 7.9 dB, the ESII (5.6 dB), the STOI (9.2 dB) and the mr-sEPSM with an RMS error of 3.5 dB.