Abstract
Imputation is a key step in Electronic Health Records-mining as it can significantly affect the conclusions derived from the downstream analysis. There are three main categories that explain the missingness in clinical settings–incompleteness, inconsistency, and inaccuracy–and these can capture a variety of situations: the patient did not seek treatment, the health care provider did not enter the information, etc. We used EHR data from patients diagnosed with Inflammatory Bowel Disease from Geisinger Health System to design a novel imputation that focuses on a complex phenotype. Our approach is based on latent-based analysis integrated with clustering to group patients based on their comorbidities before imputation. IBD is a chronic illness of unclear etiology and without a complete cure. We have taken advantage of the complexity of IBD to pre-process the EHR data of 10,498 IBD patients and show that imputation can be improved using shared latent comorbidities. The R code and sample simulated input data will be available at a future time.