Speaker identification plays a crucial role in biometric person identification as systems based on human speech are increasingly used for the recognition of people. Mel frequency cepstral coefficients (MFCCs) have been widely adopted for decades in speech processing to capture the speech-specific characteristics with a reduced dimensionality. However, although their ability to decorrelate the vocal source and the vocal tract filter make them suitable for speech recognition, they greatly mitigate the speaker variability, a specific characteristic that distinguishes different speakers. This paper presents a theoretical framework and an experimental evaluation showing that reducing the dimension of features by applying the discrete Karhunen-Loève transform (DKLT) to the log-spectrum of the speech signal guarantees better performance compared to conventional MFCC features. In particular with short sequences of speech frames, with typical duration of less than 2 s, the performance of truncated DKLT representation achieved for the identification of five speakers are always better than those achieved with the MFCCs for the experiments we performed. Additionally, the framework was tested on up to 100 TIMIT speakers with sequences of less than 3.5 s showing very good recognition capabilities.
An Investigation on the Accuracy of Truncated DKLT Representation for Speaker Identification With Short Sequences of Speech Frames / Biagetti, Giorgio; Crippa, Paolo; Falaschetti, Laura; Orcioni, Simone; Turchetti, Claudio. - In: IEEE TRANSACTIONS ON CYBERNETICS. - ISSN 2168-2267. - 47:12(2017), pp. 4235-4249. [10.1109/TCYB.2016.2603146]