Hidden Markov model (HMM) based text-to-speech (TTS) has become one of the most promising approaches, as it has proven to be a particularly flexible and robust framework to generate synthetic speech. However, several factors such as mel-cepstral vocoder and over-smoothing are responsible for causing quality degradation of synthetic speech. This paper presents an HMM speech synthesis technique based on the modified discrete cosine transform (MDCT) representation to cope with these two issues. To this end, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame from feature vectors and allows for a 50% overlap between frames without increasing the data vector, in contrast to the conventional mel-cepstral spectral parameters that do not ensure direct speech waveform reconstruction. Experimental results show that a sound of good quality, conveniently evaluated using both objective and subjective tests, is obtained.
HMM speech synthesis based on MDCT representation / Biagetti, Giorgio; Crippa, Paolo; Falaschetti, Laura; Turchetti, Claudio. - In: INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY. - ISSN 1381-2416. - 21:4(2018), pp. 1045-1055. [10.1007/s10772-018-09571-9]
HMM speech synthesis based on MDCT representation
Biagetti, Giorgio;Crippa, Paolo;Falaschetti, Laura
;Turchetti, Claudio
2018-01-01
Abstract
Hidden Markov model (HMM) based text-to-speech (TTS) has become one of the most promising approaches, as it has proven to be a particularly flexible and robust framework to generate synthetic speech. However, several factors such as mel-cepstral vocoder and over-smoothing are responsible for causing quality degradation of synthetic speech. This paper presents an HMM speech synthesis technique based on the modified discrete cosine transform (MDCT) representation to cope with these two issues. To this end, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame from feature vectors and allows for a 50% overlap between frames without increasing the data vector, in contrast to the conventional mel-cepstral spectral parameters that do not ensure direct speech waveform reconstruction. Experimental results show that a sound of good quality, conveniently evaluated using both objective and subjective tests, is obtained.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.