Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation

Vesperini, Fabio; Vecchiotti, Paolo; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco

doi:10.1109/IJCNN.2016.7727633

This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-Layer Perceptron (MLP), Bidirectional Long Short-Term Memory recurrent neural network (BLSTM) and Convolutional Neural Network (CNN). The latter has recently encountered a large success in the computational audio processing field and it has been successfully employed in our task. Two home recorded datasets are used in order to approximate real-life scenarios. They contain audio files from several microphones arranged in various rooms, from whom six features are extracted and used as input for the deep neural classifiers. The output stage has been redesigned compared to the previous author's contribution, in order to take advantage of the networks discriminative ability. Our study is composed by a multi-stage analysis focusing on the selection of the features, the network size and the input microphones. Results are evaluated in terms of Speech Activity Detection error rate (SAD). As result, a best SAD equal to 5.8% and 2.6% is reached respectively in the two considered datasets. In addiction, a significant solidity in terms of microphone positioning is observed in the case of CNN.

Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation / Vesperini, Fabio; Vecchiotti, Paolo; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco. - ELETTRONICO. - (2016), pp. 3391-3398. (Intervento presentato al convegno IJCNN 2016 tenutosi a Vancouver, Canada nel 24-29 July 2016) [10.1109/IJCNN.2016.7727633].