In this work we detail our submission to the Clarity ICASSP 2023 grand challenge, in which participants have to develop a strong target speech enhancement system for hearing-aid (HA) devices in noisy-reverberant environments. Our system builds on our previous submission at the Second Clarity Enhancement Challenge (CEC2): iNeuBe-X, which consists in an iterative neural/conventional beamforming enhancement pipeline, guided by an enrollment utterance from the target speaker. This model, which won by a large margin the CEC2, is an extension of the state-of-the-art TF-GridNet model for multi-channel, streamable target-speaker speech enhancement. Here, this approach is extended and further improved by leveraging generative adversarial training, which we show proves especially useful when the training data is limited. Using only the official 6k training scenes data, our best model achieves 0.80 hearing-aid speech perception index (HASPI) and 0.41 hearing-aid speech quality index (HASQI) scores on the synthetic evaluation set. However, our model generalized poorly on the semi-real evaluation set. This highlights the fact that our community should focus more on real-world evaluation and less on fully synthetic datasets.
Multi-Channel Speaker Extraction with Adversarial Training: The Wavlab Submission to The Clarity ICASSP 2023 Grand Challenge / Cornell, S.; Wang, Z. -Q.; Masuyama, Y.; Watanabe, S.; Pariente, M.; Ono, N.; Squartini, S.. - 2023-:(2023), pp. -2. (Intervento presentato al convegno 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 tenutosi a Rodos Palace Luxury Convention Resort, grc nel 2023) [10.1109/ICASSP49357.2023.10095961].
Multi-Channel Speaker Extraction with Adversarial Training: The Wavlab Submission to The Clarity ICASSP 2023 Grand Challenge
Cornell S.;Squartini S.
2023-01-01
Abstract
In this work we detail our submission to the Clarity ICASSP 2023 grand challenge, in which participants have to develop a strong target speech enhancement system for hearing-aid (HA) devices in noisy-reverberant environments. Our system builds on our previous submission at the Second Clarity Enhancement Challenge (CEC2): iNeuBe-X, which consists in an iterative neural/conventional beamforming enhancement pipeline, guided by an enrollment utterance from the target speaker. This model, which won by a large margin the CEC2, is an extension of the state-of-the-art TF-GridNet model for multi-channel, streamable target-speaker speech enhancement. Here, this approach is extended and further improved by leveraging generative adversarial training, which we show proves especially useful when the training data is limited. Using only the official 6k training scenes data, our best model achieves 0.80 hearing-aid speech perception index (HASPI) and 0.41 hearing-aid speech quality index (HASQI) scores on the synthetic evaluation set. However, our model generalized poorly on the semi-real evaluation set. This highlights the fact that our community should focus more on real-world evaluation and less on fully synthetic datasets.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.