A new measure to predict the a priori performance of automatic transcription systems on reverberated speech
Advances in the field of Automatic Speech Recognition (ASR) make it possible to use this technology in less and less controlled environments. However, reverberation remains a challenge. The purpose of this study is to predict the a priori reverberation impact on the ASR performance. We analysed the statistical behaviour of the vocal excitation of the vowels and created a measure based on this observation. This measure is called Excitation Behaviour (EB). To find the link between the EB and the Word Error Rate (WER) obtained by the transcription system, a regression model was calculated. To evaluate the performance of the regression model we observed the mean prediction error. We also used the same protocol on two other measures: Speech-to-Reverberation Modulation energy Ratio (SRMR) and Spectral Decay Distribution (SDD). The speech corpus used is the Wall Street Journal (WSJ0 and WSJ1) corpus. Speech was artificially reverberated using Room Impulse Response (RIR) from the REVERB challenge corpus. Recorded RIRs allow to simulate 7 different reverberation conditions: 3 rooms with different sizes (Small, Medium and Large) with 2 types of distances between a speaker and a microphone array (near=50cm and far=200 cm), and the last condition is without reverberation. ASR system was trained with the subset train_si284, the regression model was trained with dev93 and eval92 was used to test the prediction. It shows that the EB obtains an average prediction error of 13.88 while the SRMR obtains 17.87 and the SDD 17.48. Using 20 utterances (approximately 2m20s), the average prediction error decreases to 7.63 for the EB, 13.08 for the SRMR and 13.02 for the SDD. EB measure is better correlated to ASR performance than other reverberation measures.