This paper presents a Voice Activity Detector (VAD) for multi-room domestic scenarios. A multi-room VAD (mVAD) simultaneously detects the time boundaries of a speech segment and determines the room where it was generated. The proposed approach is fully data-driven and is based on a Deep Neural Network (DNN) pre-trained as a Deep Belief Network (DBN) and fine-tuned by a standard error back-propagation method. Six different types of feature sets are extracted and combined from multiple microphone signals in order to perform the classification. The proposed DBN-DNN multi-room VAD (simply referred to as DBN-mVAD) is compared to other two NN based mVADs: a Multi-Layer Perceptron (MLP-mVAD) and a Bidirectional Long Short-Term Memory recurrent neural network (BLSTM-mVAD). A large multi-microphone dataset, recorded in a home, is used to assess the performance through a multi-stage analysis strategy comprising multiple feature selection stages alternated by network size and input microphones selections. The proposed approach notably outperforms the alternative algorithms in the first feature selection stage and in the network selection one. In terms of area under precision-recall curve (AUC), the absolute increment respect to the BLST-mVAD is 5.55%, while respect to the MLP-mVAD is 2.65%. Hence, solely the proposed approach undergoes the remaining selection stages. In particular, the DBN-mVAD achieves significant improvements: in terms of AUC and F-measure the absolute increments are equal to 10.41% and 8.56% with respect to the first stage of DBN-mVAD.
A Deep Neural Network approach for Voice Activity Detection in multi-room domestic scenarios / Ferroni, Giacomo; Bonfigli, Roberto; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco. - Volume 2015:(2015). (Intervento presentato al convegno International Joint Conference on Neural Networks, IJCNN 2015 tenutosi a Killarney; Ireland nel 12 July 2015 through 17 July 2015) [10.1109/IJCNN.2015.7280510].
A Deep Neural Network approach for Voice Activity Detection in multi-room domestic scenarios
FERRONI, GIACOMO;Bonfigli, Roberto;PRINCIPI, EMANUELE;SQUARTINI, Stefano;PIAZZA, Francesco
2015-01-01
Abstract
This paper presents a Voice Activity Detector (VAD) for multi-room domestic scenarios. A multi-room VAD (mVAD) simultaneously detects the time boundaries of a speech segment and determines the room where it was generated. The proposed approach is fully data-driven and is based on a Deep Neural Network (DNN) pre-trained as a Deep Belief Network (DBN) and fine-tuned by a standard error back-propagation method. Six different types of feature sets are extracted and combined from multiple microphone signals in order to perform the classification. The proposed DBN-DNN multi-room VAD (simply referred to as DBN-mVAD) is compared to other two NN based mVADs: a Multi-Layer Perceptron (MLP-mVAD) and a Bidirectional Long Short-Term Memory recurrent neural network (BLSTM-mVAD). A large multi-microphone dataset, recorded in a home, is used to assess the performance through a multi-stage analysis strategy comprising multiple feature selection stages alternated by network size and input microphones selections. The proposed approach notably outperforms the alternative algorithms in the first feature selection stage and in the network selection one. In terms of area under precision-recall curve (AUC), the absolute increment respect to the BLST-mVAD is 5.55%, while respect to the MLP-mVAD is 2.65%. Hence, solely the proposed approach undergoes the remaining selection stages. In particular, the DBN-mVAD achieves significant improvements: in terms of AUC and F-measure the absolute increments are equal to 10.41% and 8.56% with respect to the first stage of DBN-mVAD.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.