This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving “who spoke what, when” concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and “Whisper-style” prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios
One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition / Cornell, Samuele; Jung, Jee-Weon; Watanabe, Shinji; Squartini, Stefano. - (2024), pp. 11856-11860. (Intervento presentato al convegno 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 tenutosi a Seoul - Korea nel 14-19 Aprile 2024) [10.1109/icassp48485.2024.10447957].
One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition
Cornell, Samuele
;Squartini, Stefano
2024-01-01
Abstract
This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving “who spoke what, when” concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and “Whisper-style” prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenariosFile | Dimensione | Formato | |
---|---|---|---|
One_Model_to_Rule_Them_All__Towards_End-to-End_Joint_Speaker_Diarization_and_Speech_Recognition.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso:
Tutti i diritti riservati
Dimensione
1.01 MB
Formato
Adobe PDF
|
1.01 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.