Comparing approaches towards robust voice activity detection in noise
We examine several approaches towards robust Voice Activity Detection (VAD) in low Signal-to-Noise Ratios (SNRs) automatic speech recognition (ASR) applications. The aim is to derive an optimal solution for the VAD component to be applied in real-world speech interaction systems (e.g., human-machine-interaction, home automation, etc.). Generally speaking, the VAD component of a speech interaction system decides which segments of the incoming audio stream are forwarded to the ASR component for analysis.
Here, we compare several VAD methods for the application in various low-SNR scenarios. We evaluate three distinct VAD methods, examine their combination, and analyze the influence of context – prior speaker information as well as changing noise characteristics – on the methods. More precisely, the first method is based on an energy threshold in the frequency domain (subSNR), the second method evaluates a model of spectral peak frequencies extracted from isolated vowel segments (peakSig), and the third uses a supervised machine learning approach applying a Support Vector Machine (svmVAD).
The used speech data consist of clean speech spoken by 52 subjects. In total, each subject contributes the same 64 utterances, for additive noise data we used 8 different industrial noise recordings. For each speaker, we reserve 1/4 of the utterances for testing and build individual training data – depending on the contextual information – with the remaining data of all speakers. With 8 noise types, 4 SNRs (-5, 0, 5, 10 dB), and two different sets of training utterances for each speaker we hence created 64 datasets. We then averaged the resulting evaluation metrics (F1 for VAD performance and WER for ASR performance) over all speakers to yield an estimate for the performance of the system under test.
Results show that svmVAD achieves best performance, over all SNRs and noise types, for both metrics. SubSNR reports worst performance, especialy for very low SNRs. Moreover, contextual information (i.e., updating the noise prints over time) is most beneficial for subSNR, since it directly incorporates the noise properties. Next, prior speaker information (i.e., training and test data from the same speaker) does not improve performance figures; hence the properties used by the peakSig and svmVAD seem to be speaker-independent. Finally, combining subSNR and peakSig improves performance; nevertheless svmVAD performance could not be reached. In conclusion, we observed that svmVAD outperforms the signal processing methods in all regards, suggesting an application of this method in real-world speech interaction systems.