This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real- life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in vari- ous rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statis- tics, achieving in the best overall case a SAD equal to 7.0%.

Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment / Vecchiotti, Paolo; Vesperini, Fabio; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco. - ELETTRONICO. - 69:(2017), pp. 161-170. [10.1007/978-3-319-56904-8_16]

Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment

VECCHIOTTI, PAOLO;VESPERINI, FABIO;PRINCIPI, EMANUELE;SQUARTINI, Stefano;PIAZZA, Francesco
2017-01-01

Abstract

This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real- life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in vari- ous rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statis- tics, achieving in the best overall case a SAD equal to 7.0%.
2017
Multidisciplinary Approaches to Neural Computing
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/241540
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? ND
social impact