A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets

IRIS

The missing data mechanism is a relevant problem in Machine Learning (ML) and biomedical informatics communities. Real-world Electronic Health Record (EHR) datasets comprise several missing values, thus revealing a high level of spatiotemporal sparsity in the predictors' matrix. Several approaches in the state-of-the-art tried to deal with this problem by proposing different data imputation strategies that (i) are often unrelated to the ML model, (ii) are not conceived for EHR data where laboratory exams are not prescribed uniformly over time and percentage of missing values is high (iii) exploit only univariate and linear information on the observed features. Our paper proposes a data imputation strategy based on a clinical conditional Generative Adversarial Network (ccGAN) capable of imputing missing values by exploiting non-linear and multivariate information across patients. Unlike other GAN data imputation-based approaches, our method deals explicitly with the high level of missingness of routine EHR data by conditioning the imputing strategy to the observable values and those fully-annotated. We demonstrated the statistical significance of the ccGAN to other state-of-the-art approaches in terms of imputation (around 19.79% of gain to the best competitor) and predictive performance (up to 1.60% of gain to the best competitor) on a real multi-diabetic centers dataset. We also demonstrated its robustness across different missingness rates (up to 1.61% of gain to the best competitor in the highest missingness rates condition) on an additional benchmark EHR dataset.

A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets / Bernardini, Michele; Doinychko, Anastasiia; Romeo, Luca; Frontoni, Emanuele; Amini, Massih-Reza. - In: COMPUTERS IN BIOLOGY AND MEDICINE. - ISSN 0010-4825. - 163:(2023). [10.1016/j.compbiomed.2023.107188]

A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets

Bernardini, Michele;Doinychko, Anastasiia;Romeo, Luca;Frontoni, Emanuele;Amini, Massih-Reza

2023-01-01

Abstract

The missing data mechanism is a relevant problem in Machine Learning (ML) and biomedical informatics communities. Real-world Electronic Health Record (EHR) datasets comprise several missing values, thus revealing a high level of spatiotemporal sparsity in the predictors' matrix. Several approaches in the state-of-the-art tried to deal with this problem by proposing different data imputation strategies that (i) are often unrelated to the ML model, (ii) are not conceived for EHR data where laboratory exams are not prescribed uniformly over time and percentage of missing values is high (iii) exploit only univariate and linear information on the observed features. Our paper proposes a data imputation strategy based on a clinical conditional Generative Adversarial Network (ccGAN) capable of imputing missing values by exploiting non-linear and multivariate information across patients. Unlike other GAN data imputation-based approaches, our method deals explicitly with the high level of missingness of routine EHR data by conditioning the imputing strategy to the observable values and those fully-annotated. We demonstrated the statistical significance of the ccGAN to other state-of-the-art approaches in terms of imputation (around 19.79% of gain to the best competitor) and predictive performance (up to 1.60% of gain to the best competitor) on a real multi-diabetic centers dataset. We also demonstrated its robustness across different missingness rates (up to 1.61% of gain to the best competitor in the highest missingness rates condition) on an additional benchmark EHR dataset.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Rivista su cui è pubblicata l'opera
	
				COMPUTERS IN BIOLOGY AND MEDICINE
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.compbiomed.2023.107188
			
	Parole chiave
	
				Data imputation; Electronic Health Record; Generative Adversarial Network; Machine Learning; Predictive medicine
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0010482523006534-main.pdf accesso aperto Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza d'uso: Creative commons Dimensione 2.21 MB Formato Adobe PDF Visualizza/Apri	2.21 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/319512

Citazioni

ND

25

21

social impact