Bayesian estimators, especially the Minimum Mean Square Error (MMSE) and the Maximum A Posteriori (MAP), are very popular in estimating the clean speech STFT coefficients. Recently, a similar trend has been successfully applied to speech feature enhancement for robust Automatic Speech/Speaker Recognition (ASR) applications either in the Mel, log-Mel or in the cepstral domain. It is a matter of fact that the goodness of the estimate directly depends on the assumptions made about the noise and speech coefficients densities. Nevertheless, while this latter has been exhaustively studied in the case of STFT coefficients, not equivalent attention has been paid to the case of speech features. In this paper, we study the distribution of Mel, log-Mel as well as MFCC coefficients obtained from speech segments. The histograms of the speech features are first fitted into several pdf models by means of the Chi-Square Goodness-of-Fit test, then they are modeled using a Gaussian Mixture Model (GMM). Performed computer simulations show that the choice of log-Mel and MFCC coefficients is more convenient w.r.t. the Mel one from this perspective.

An Evaluation Study on Speech Feature Densities for Bayesian Estimation in Robust ASR / S., Cifani; Principi, Emanuele; R., Rotili; Squartini, Stefano; Piazza, Francesco. - Volume 6456 LNCS:(2011), pp. 283-297. [10.1007/978-3-642-18184-9_23]

An Evaluation Study on Speech Feature Densities for Bayesian Estimation in Robust ASR

PRINCIPI, EMANUELE;SQUARTINI, Stefano;PIAZZA, Francesco
2011-01-01

Abstract

Bayesian estimators, especially the Minimum Mean Square Error (MMSE) and the Maximum A Posteriori (MAP), are very popular in estimating the clean speech STFT coefficients. Recently, a similar trend has been successfully applied to speech feature enhancement for robust Automatic Speech/Speaker Recognition (ASR) applications either in the Mel, log-Mel or in the cepstral domain. It is a matter of fact that the goodness of the estimate directly depends on the assumptions made about the noise and speech coefficients densities. Nevertheless, while this latter has been exhaustively studied in the case of STFT coefficients, not equivalent attention has been paid to the case of speech features. In this paper, we study the distribution of Mel, log-Mel as well as MFCC coefficients obtained from speech segments. The histograms of the speech features are first fitted into several pdf models by means of the Chi-Square Goodness-of-Fit test, then they are modeled using a Gaussian Mixture Model (GMM). Performed computer simulations show that the choice of log-Mel and MFCC coefficients is more convenient w.r.t. the Mel one from this perspective.
2011
Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues
9783642181832
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/42013
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact