In this paper an effective technique to train an acoustic model from large and unsynchronized audio and text chunks is presented. Given such a speech corpus, an algorithm to automatically segment each chunk into smaller fragments and to synchronize those to the corresponding text is defined. These smaller fragments are more suitable to be used in standard model training algorithms for usage in automatic speech recognition systems. The proposed approach is particularly suitable to bootstrap language models without relying neither on specialized training material nor borrowing from models trained for other similar languages. Extensive experimentation using the CMU Sphinx 4 recognizer and the SphinxTrain model generator in a setting designed for large-vocabulary continuous speech recognition shows the effectiveness of the approach.
Semi-automatic acoustic model generation from large unsynchronized audio and text chunks / Alessandrini, Michele; Biagetti, Giorgio; Curzi, Alessandro; Turchetti, Claudio. - (2011), pp. 1681-1684. (Intervento presentato al convegno Interspeech 2011 tenutosi a Florence, Italy nel 27/08/2011-31/08/2011).
Semi-automatic acoustic model generation from large unsynchronized audio and text chunks
ALESSANDRINI, MICHELE;BIAGETTI, Giorgio;CURZI, ALESSANDRO;TURCHETTI, Claudio
2011-01-01
Abstract
In this paper an effective technique to train an acoustic model from large and unsynchronized audio and text chunks is presented. Given such a speech corpus, an algorithm to automatically segment each chunk into smaller fragments and to synchronize those to the corresponding text is defined. These smaller fragments are more suitable to be used in standard model training algorithms for usage in automatic speech recognition systems. The proposed approach is particularly suitable to bootstrap language models without relying neither on specialized training material nor borrowing from models trained for other similar languages. Extensive experimentation using the CMU Sphinx 4 recognizer and the SphinxTrain model generator in a setting designed for large-vocabulary continuous speech recognition shows the effectiveness of the approach.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.