This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real- life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in vari- ous rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statis- tics, achieving in the best overall case a SAD equal to 7.0%.
Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment / Vecchiotti, Paolo; Vesperini, Fabio; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco. - ELETTRONICO. - 69:(2017), pp. 161-170. [10.1007/978-3-319-56904-8_16]
Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment
VECCHIOTTI, PAOLO;VESPERINI, FABIO;PRINCIPI, EMANUELE;SQUARTINI, Stefano;PIAZZA, Francesco
2017-01-01
Abstract
This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real- life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in vari- ous rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statis- tics, achieving in the best overall case a SAD equal to 7.0%.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.