This paper presents a technique for learning hidden Markov model (HMM) state sequences from phonemes, that combined with modified discrete cosine transform (MDCT), is useful for speech synthesis. Mel-cepstral spectral parameters, currently adopted in the conventional methods as features for HMM acoustic modeling, do not ensure direct speech waveforms reconstruction. In contrast to these approaches, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame feature vectors and allows for a 50% overlap between frames without increasing the data rate. Experimental results show that the spectrograms achieved with the suggested technique behave very closely to the original spectrograms, and the quality of synthesized speech is conveniently evaluated using the well known Itakura-Saito measure.

Learning HMM State Sequences from Phonemes for Speech Synthesis / Biagetti, Giorgio; Crippa, Paolo; Falaschetti, Laura; Orcioni, Simone; Turchetti, Claudio. - In: PROCEDIA COMPUTER SCIENCE. - ISSN 1877-0509. - ELETTRONICO. - 96:(2016), pp. 1589-1596. (Intervento presentato al convegno 20th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES-2016) tenutosi a York, UK nel 5-7 September 2016) [10.1016/j.procs.2016.08.206].

Learning HMM State Sequences from Phonemes for Speech Synthesis

BIAGETTI, Giorgio;CRIPPA, Paolo;FALASCHETTI, LAURA;ORCIONI, Simone;TURCHETTI, Claudio
2016-01-01

Abstract

This paper presents a technique for learning hidden Markov model (HMM) state sequences from phonemes, that combined with modified discrete cosine transform (MDCT), is useful for speech synthesis. Mel-cepstral spectral parameters, currently adopted in the conventional methods as features for HMM acoustic modeling, do not ensure direct speech waveforms reconstruction. In contrast to these approaches, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame feature vectors and allows for a 50% overlap between frames without increasing the data rate. Experimental results show that the spectrograms achieved with the suggested technique behave very closely to the original spectrograms, and the quality of synthesized speech is conveniently evaluated using the well known Itakura-Saito measure.
2016
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/238344
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 2
social impact