In this paper, we propose a system for rare sound event detection using a hierarchical and multi-scaled approach based on Convolutional Neural Networks (CNN). The task consists on detection of event onsets from artificially generated mixtures. Spectral features are extracted from frames of the acoustic signals, then a first event detection stage operates as binary classifier at frame-rate and it proposes to the second stage contiguous blocks of frames which are assumed to contain a sound event. The second stage refines the event detection of the prior network, discarding blocks that contain background sounds wrongly classified by the first stage. Finally, the effective onset time of the active event is obtained. The performance of the algorithm has been assessed with the material provided for the second task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The achieved overall error rate and F-measure, resulting respectively equal to 0.22 and 88.50% on the evaluation dataset, significantly outperforms the challenge baseline and the system guarantees improved generalization performance with a reduced number of free network parameters w.r.t. other competitive algorithms.
Hierarchic convnets framework for rare sound event detection / Vesperini, F.; Droghini, D.; Principi, E.; Gabrielli, L.; Squartini, S.. - 2018-:(2018), pp. 1497-1501. (Intervento presentato al convegno 26th European Signal Processing Conference, EUSIPCO 2018 tenutosi a Rome, Italy nel 2018) [10.23919/EUSIPCO.2018.8553089].
Hierarchic convnets framework for rare sound event detection
Vesperini F.;Droghini D.;Principi E.;Gabrielli L.;Squartini S.
2018-01-01
Abstract
In this paper, we propose a system for rare sound event detection using a hierarchical and multi-scaled approach based on Convolutional Neural Networks (CNN). The task consists on detection of event onsets from artificially generated mixtures. Spectral features are extracted from frames of the acoustic signals, then a first event detection stage operates as binary classifier at frame-rate and it proposes to the second stage contiguous blocks of frames which are assumed to contain a sound event. The second stage refines the event detection of the prior network, discarding blocks that contain background sounds wrongly classified by the first stage. Finally, the effective onset time of the active event is obtained. The performance of the algorithm has been assessed with the material provided for the second task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The achieved overall error rate and F-measure, resulting respectively equal to 0.22 and 88.50% on the evaluation dataset, significantly outperforms the challenge baseline and the system guarantees improved generalization performance with a reduced number of free network parameters w.r.t. other competitive algorithms.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.