This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving “who spoke what, when” concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and “Whisper-style” prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios

One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition / Cornell, Samuele; Jung, Jee-Weon; Watanabe, Shinji; Squartini, Stefano. - (2024), pp. 11856-11860. (Intervento presentato al convegno 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 tenutosi a Seoul - Korea nel 14-19 Aprile 2024) [10.1109/icassp48485.2024.10447957].

One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Cornell, Samuele
;
Squartini, Stefano
2024-01-01

Abstract

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving “who spoke what, when” concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and “Whisper-style” prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios
2024
979-8-3503-4485-1
979-8-3503-4486-8
File in questo prodotto:
File Dimensione Formato  
One_Model_to_Rule_Them_All__Towards_End-to-End_Joint_Speaker_Diarization_and_Speech_Recognition.pdf

Solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso: Tutti i diritti riservati
Dimensione 1.01 MB
Formato Adobe PDF
1.01 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/337392
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 2
social impact