One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

IRIS

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving “who spoke what, when” concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and “Whisper-style” prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios

One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition / Cornell, Samuele; Jung, Jee-Weon; Watanabe, Shinji; Squartini, Stefano. - (2024), pp. 11856-11860. (Intervento presentato al convegno 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 tenutosi a Seoul - Korea nel 14-19 Aprile 2024) [10.1109/icassp48485.2024.10447957].

One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Cornell, Samuele;Jung, Jee-Weon;Watanabe, Shinji;Squartini, Stefano

2024-01-01

Abstract

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving “who spoke what, when” concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and “Whisper-style” prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2024
			
	Codice ISBN
	
				979-8-3503-4485-1
979-8-3503-4486-8
			
	Codice DOI
	
				https://dx.doi.org/10.1109/icassp48485.2024.10447957
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
One_Model_to_Rule_Them_All__Towards_End-to-End_Joint_Speaker_Diarization_and_Speech_Recognition.pdf Solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza d'uso: Tutti i diritti riservati Dimensione 1.01 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.01 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/337392

Citazioni

ND

13

4

social impact