Abstract
Imputation is a key step in Electronic Health Records-mining as it can significantly affect the conclusions derived from the downstream analysis. There are three main categories that explain the missingness in clinical settings–incompleteness, inconsistency, and inaccuracy–and these can capture a variety of situations: the patient did not seek treatment, the health care provider did not enter the information, etc. We used EHR data from patients diagnosed with Inflammatory Bowel Disease from Geisinger Health System to design a novel imputation that focuses on a complex phenotype. Our approach is based on latent-based analysis integrated with clustering to group patients based on their comorbidities before imputation. IBD is a chronic illness of unclear etiology and without a complete cure. We have taken advantage of the complexity of IBD to pre-process the EHR data of 10,498 IBD patients and show that imputation can be improved using shared latent comorbidities. The R code and sample simulated input data will be available at a future time.
I. Introduction
Given the complexity and high-dimensionality of Electronic Health Records (EHR) the missingness and the need for imputation are an inevitable aspect in any study that attempts to use such data for downstream analysis. The EHR is not designed for research purposes, even though the breadth and depth of this information can be used to improve care at many levels, including developing predictive models of prognosis and designing in silico clinical trials [1-3]. Furthermore, the level and extent of the missing values in health care systems at large are not random. In fact, there are three main categories that explain the missingness in clinical settings [4, 5] – incompleteness, inconsistency, and inaccuracy – and these can capture a variety of situations, including the following: the patient could have been cared for outside of the health care system where the data are collected, the patient did not seek treatment, the health care provider did not enter the information, the patient has expired, the missing value was not needed for standard of care. Therefore, imputation can dramatically lead to biased results. Furthermore, excluding variables or patients with high-level of missingness can also introduce bias and reduce the scope of the study significantly. To address these challenges, different imputation techniques have been proposed, and each have their own advantages and limitations.
In a recent study, 12 different imputation techniques that were applied to laboratory measures from EHR were compared. In general, authors found that Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error [6]; however, in that study, analysis was restricted to 28 most commonly available variables. In another study, authors assessed the different causes of missing data in the EHR data and identified these causes to be the source of unintentional bias [7]. Comparative analysis of three methods of imputation (a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average for DNA microarrays showed that in general KNN and SVD methods surpass the commonly accepted solutions of filling missing values with zeros or row average [8]. However, comparing imputation for clinical data with DNA microarray can be misleading. Missingness in DNA microarray is likely missing-at-random due to technical issues unlike missingness in EHR which is missing-not-at-random. In another study, fuzzy clustering was integrated with neural network to enhance the imputation process [9]. Furthermore, research has been done to evaluate imputation methods for non-normal data [10]. Using simulated data from a range of non-normal distributions and a level of missingness of 50% (missing completely at random or missing at random), it was found that the linearity between variables could be used to determine the need for transformation for non-normal variables. In the case of a linear relationship, transformation can introduce bias, while non-linear relationship between variables may require adequate transformation to accurately capture the non-linearity. Furthermore, many of the techniques are optimized for smaller levels of missingness (the most commonly available measurements), yet most clinical datasets (including the EHRs) have a significant percentage of missingness for many of their variables. To address this problem, machine learning methods have also been proposed [11]. In fact, the challenges of imputation for EHRs are unique, and if left unaddressed the utility of the data becomes limited [12]. Consequently, even though, for smaller targeted studies, it could be possible to integrate additional modalities, or perform analytical evaluation through chart review, to determine a likely cause of missingness; for larger studies this becomes infeasible. For examples, missingness level for many of very important variables (ex: inflammatory markers such as CRP, vitamin levels, or even A1C levels, a common biomarker for diabetes) can easily reach 50% or more in many realistic large datasets. Finally, given the complexity and the scale of the problem, in many studies, MICE [13] remains the method of choice. The MICE [13] algorithm is regression-based and assumes that the non-missing variables can be used for the predicting missing variables. However, this assumption does not always hold in EHR, yet, given the high-level of redundancy and presence of highly correlated variables in the EHR, imputation by MICE still performs relatively well for large clinical datasets. A comprehensive overview of handling missing data in EHR is presented in [12].
In this study, we have developed a novel imputation strategy based on latent-based analysis integrated with clustering to group patients based on their comorbidities before imputation. We apply this method on 10,498 patients diagnosed with Inflammatory Bowel Disease (IBD) and show its strength in analyzing laboratory measures with missingness levels up to 95%.
II. Methodology
Dataset
IBD is a complex disease with two main manifestations: Ulcerative Colitis (UC) and Crohn’s Disease (CD). Both UC and CD are highly significant public health problem in the U.S. and worldwide due to increasing incidence, morbidity, and mortality [14]. Moreover, IBD is a complex disease with no clear etiology or associated risk factors. Utilization of EHR can help facilitate better understanding this condition and continuous improvement in the design of treatment strategies for a more personalized care path.
Data Extraction and Processing
We have identified an IBD cohort from the EHR of Geisinger Health System. Inclusion criteria of this cohort were based on extraction of patient population based on diagnosis recorded for patients under their visits, admissions and currently active problems listed under problem list, based on ICD9 and ICD10 codes for CD and UC (see Table 1). In order to confirm this diagnosis on patients, qualifying criteria included either two or more outpatient encounters, or one or more inpatient admissions, or an entry into problem list with an active flag checked. We have extracted clinical laboratory measurements for this cohort using the Logical Observation Identifiers Names and Codes (LOINC) system (see Table S1). For comorbidities, we extracted all the diagnosis for all the patients based on the ICD9 as well as ICD10 codes. Comorbidity data included details from visits, admissions, and problem list. We identified 10,498 patients with both comorbidity and laboratory data in the EHR. A total of 129 laboratory values, and 976 ICD codes (using 2-digit roll-up) were used. We excluded laboratory values with more than 95% missingness.
Two versions of the laboratory values were created. In the first version, the actual laboratory values (continuous values) were kept and the median values for each measurement were calculated. In the second version, all the laboratory values were converted to binary values using the flag in the system prior to calculating the median values. For instance, if a value was above or below the threshold (abnormal flag), then that value was converted to one; while for a value within the normal range, a zero was recorded (see Fig. 1). As a pre-processing step, any value that deviated more than 3 standard deviation from the mean was removed. Furthermore, a 0.1% white noise (error noise) was added to the binary values. Four different datasets were created, each with a limit on the level of missingness, ranging from low (5%), to moderation (50% and 75%), to extreme (95%).
A binary comorbidity matrix for our cohort was also created using data from EHR. Data from primary or secondary diagnosis, as well as problem list were used for this purpose. During the first phase, ICD9/10 codes were compiled for each patient. During the second phase, the matrix was compressed using 3-digit roll up strategy, where ICD codes were combined if the first three digits were identical (ex: 289.52 and 289.53 --> 298). Furthermore, an ICD9/10 code was removed from the patient’s record if it was referenced only once. The resulting matrix was then converted to binary to represent the presence or absence of an ICD code for each patient. During the final stage, singular value decomposition (SVD) was applied to the matrix (Eq. 1) to compute the encoding of the dataset. Where APT_ICD is the matrix encompassing all the ICD9/10 codes (presence of absence) for all the patients; U is an mxm matrix, S is an mxn diagonal matrix, and V is an nxn matrix. The colums of V are eigenvectors of ATA, and colums of U are eigenvectors of AAT. The diagonal elements of S are the square root of the eigenvalues of ATA or AAT.
The encoding matrix was then used to create different level of data abstraction by retaining only 10, 15, 66, or 85% of the encoding using dimensionality reduction technique (Eq. 2). Note that the approximation matrix is referred to as the data abstraction. The finalized output is referred to as latent comorbidities. Figure 2 summarize these steps. Where g is the level of abstraction (10%, 15%, 66%, and 85%), corresponding to the level of reduced matrices. APT_ICD_g is an approximation of the initial matrix (APT_ICD).
Imputation Strategies
Latent-based model
Our imputation method is a hybrid method, referred to as “latent-based model", that is based upon concurrently applying dimensionality reduction and clustering strategy. This hybrid method, efficiently captures relationships among features (or variables) and reduces noise (through dimensionality reduction) while providing an adaptive mechanism to perform imputation for any complex phenotype or trait. Using latent comorbidity data, patients were clustered using k-mean clustering technique. The data was clustered into 2, 4, 8, 16, 32 and 64 clusters. Imputation, using MICE, was then applied to each sub-group independently to predict the missing values. Furthermore, different variations in implementation were also explored.
Latent-based model with reduced sparsity
using the information from the singular value decomposition, the comorbidity data were evaluated for sparsity. The SVD captures the encoding of the matrix and insignificant ICD9/10 codes are transformed to values close to zero. Therefore, removing ICD9/10 codes corresponding to low values will reduce the sparsity of the matrix without eliminating important information. Three levels were assessed and number of disease codes (ICD9/10) was reduced to build three different datasets with varying level of sparsity. High sparsity is referred to when the sum of all the values for a ICD9/10 code in latent comorbidity matrix is less than 1 (that is the final matrix is very sparse by including the majority of the ICD9/10 codes), medium when the sum is less than 10 and low sparsity is when the sum is below 100 (only highly used ICD9/10 codes are included in the dataset). This strategy will also reduce noise, while also decreasing the size of the matrix, which will slightly reduce the computational complexity.
Feature expansion
a feature-expanded model was developed to evaluate if addition of comorbidity data as added features to the lab measurements could enhance the imputation performance as compared to the implementation of the hybrid model. In this situation, the matrix with the binary laboratory measures was expanded with the addition of comorbidity matrix (see Fig 3). The MICE imputation was then performed to predict the values of missing lab measurements.
Evaluation Strategy
Model evaluation was performed by randomly selecting variables and predicting them using the different strategies. To ensure fair distribution of random values that were withheld for testing before imputation, the data were split into bins based on missingness. The data were split into 2 bins in case of 5% missingness and 4 bins in case of 50%, 75% and 95% missingness. Equal number of values were withheld from each of the bins. Probability is low for selecting values from labs with high sparsity, thus sampling from bins ensures random values chosen to be withheld for testing are picked from all the levels of missingness. Each run was then repeated 10 times and the P-value was calculated using standard t-test statistics. The root mean square error (RMSE) was also calculated and averaged over the 10 runs. Comparison was based on calculating the difference between running imputation using the latent-based model and its’ various derivations techniques, and the standard MICE algorithm.
III. Results
A rich clinical dataset for Inflammatory Bowel Disease facilitates the development of imputation techniques targeted for complex phenotypes
We have identified 10,498 IBD patients from the Geisinger EHRs, with rich longitudinal data, spanning an average of nine years (time between first time diagnosed and current age) with over 10% of the IBD patients having over 15 years of EHR data in the system capturing their treatment experience and co-morbidities. Furthermore, only less than 20% of our IBD patients have less than five years of clinical data for analysis, making this dataset rich. In general, we observed, that for a very small set of the variables, the percentage of missingness was very small; however, the majority of the laboratory values had a relatively larger number of missing level. Fig. 4 highlights this trend. These observations further motivate the need for development of a method that is robust for imputation of sparse datasets.
Comorbidity can improve imputation for complex phenotypes
Our novel method when compared with standard MICE algorithm shows significant improvement for variables with up to 95% missingness (Fig 5), even the comparison was disadvantageous for the hybrid method since the number of patients was larger when the data was not clustered. The goal of this study was to identify a combination of model parameters (number of clusters, data abstraction, and sparsity level), such that the imputation can be optimized for a given level of missingness in the data extracted from EHR. For instance, if low level of missingness is what is required, then for this specific dataset, optimal imputation can be performed by finding optimal region with that setting, considering both improved performance and significance level (P-value). However, if more lab measurements are of interest with possibly reaching a 75% missingness for some lab values, then panel A3,7,11 can be compared with their respective B3,7,11 P-value levels (see Fig. 5). Furthermore, if binary values are of interest as opposed to actual values, then one may consider evaluating sub-panels C1-12 and their corresponding p-values (D1-12). This strategy demonstrates the need to better understand the scope of the imputation technique prior to estimating the missing values from large and complex dataset retrieved from EHRs.
Increasing the feature space deteriorates the performance of imputation
By considering the feature expansion model alternative, we have increased the dimensionality of our feature space from 129 features (lab values) to about 1,000 features (129 labs + 976 ICD9 codes) for a fix population of 10K+ IBD patients. This alternative proved to be outperforming slightly the standard MICE, but for only the 5% missingness level (see Fig. 6). For higher level of missingness, the extended model did not outperform the standard MICE with the 129 variables.
IV. Discussion and Conclusion
The use of heterogeneous and large-scale clinical datasets, such as EHRs, provide an avenue for exploration of strategies to improve care at individualized levels, which include developing personalized models of response to therapy for complex diseases such as IBD. However, the data extracted from EHRs is noisy and have a large number of missing values, which requires a robust imputation strategy designed for large dataset of heterogeneous population. For realistic applications, It is not recommended to solely rely on redundancy of EHR data to conduct imputation, as our imputation strategy suggests. Our imputation strategy attempts to address this critical challenge using the IBD as a case study.
The IBD population is heterogeneous and a clear understanding of its risk factors is still lacking; therefore, treatment plan is usually still designed based on the average patient. Recent advances in knowledge of IBD pathogenesis have led to the implication of a complex interplay between metabolic reprogramming and immunity [15]. Furthermore, response to treatment in IBD varies significantly among individuals and disease subtypes based on demographic characteristics, diet, comorbidities, underlying immunological factors, and genetic polymorphisms. Thus, there is an urgent unmet clinical need to replace current approaches with personalized strategies that consider individual variability and diversity, and clinical data, if properly used, can provide a valuable resource in this field.
There has been a large number of studies developing novel imputation methods, and various studies assessing the use of different methods for large clinical studies. However, these methods do not tend to focus on a disease and more importantly on a complex disease or trait, rather they focus on the scale of the problem (number of cases, number of variables). It is only in some instances that studies are designed to investigate the possible correlation between variables and the effect of such correlation on the imputation outcome. However, majority of imputation methods for large clinical datasets (such as EHRs) will be used in a well-defined context, such as a specific disease or disease category (ex: infectious diseases, immune diseases, cancer, or cardiovascular diseases). In general, in a given study, the focus is rarely on all diseases at the same time, therefore, methods developed for imputation should also be designed with that in mind. Easy access to EHRs from large hospital systems could be one of the major limiting factors in the lack of development of novel imputation techniques for data from EHRs.
In this study, we have developed an intuitive method that can be applied for predicting missing values for complex diseases from EHRs. We have used the EHR s from Geisinger Health System, a pioneering institution in personalized and precision medicine. Geisinger has a national reputation for both high quality patient care and for being an innovative, integrated health care delivery system. This is especially true around electronic health information, being an early adopter of EPIC (1996) and building an enterprise-wide clinical data warehouse that contains comprehensive clinical and insurance claims data. We have reported demographic and clinical characteristics of the active patients in a recent publication [16].
The method presented here is an intuitive approach for any given complex disease, where bio-signatures or risk factors are only partially known and the relationship among the variables can be convoluted given the large dimensionality of the dataset. Furthermore, level of missingness can be high, even though the best results are typically obtained when the level of missingness is low or moderate. However, given the disease or condition of interest, the investigator will have the ability to include variables that may be crucial for the study. For instance, in the case of IBD, key variables of interest include CRP (missing at 56%), sedimentation rate (missing at 33%) which are not among the common variables in our dataset. Therefore, selecting the most commonly used variables will exclude important parameters for an investigation into the IBD cohort.
Using this approach, we were able to predict missingness of laboratory values that had missing level up to 95%. Our method was able to better predict the missing values than the standard algorithm. Furthermore, we have also shown that by simply integrating more features into the model and building very high-dimensional dataset (with about 1000 features) may not be an appropriate strategy, at least for a cohort of 10,000+ patients. Therefore, the hybrid-based strategy may be a better alternative in most situations.
Even though we had access to a large cohort of IBD population, our dataset was still limited if compared to more common conditions such as Diabetes with over 80,000 patients in our Geisinger cohort, or our COPD cohort of >49,000 patients. For those cases, more clusters could be formed to further improve the prediction outcome. Furthermore, we have only imputed median values; however, given a disease such as IBD, evaluating laboratory measures during the various stages of the disease may be more useful and also more predictive of the disease progression and relapse. However, imputing laboratory measures for each encounter is challenging at many levels, one such challenge is the significantly higher level of missingness in the dataset. As a future direction, we will investigate how to best impute longitudinal laboratory measures to better inform clinical studies. In addition, we will also explore integrating additional features, such as demographic information, age, gender, medication usage, as well as genetic information when available to further enhance the imputation outcome. Finally, we will evaluate various pre-processing and normalization strategies and evaluate if these manipulations can improve the outcome of our predictions, especially for variables with non-normal distributions. At last, even though MICE is one of the most popular imputation technique for EHRs, it would be interesting to compare our method with other algorithms.
To conclude, we have optimized the level of abstraction needed to improve the imputation for our IBD cohort of 10,498 patients. Our novel method when compared with standard “multiple imputation using chained equations” (MICE) shows significant improvement of up to 49% for variables with up to 95% missingness. We have shown that imputation can be improved, using shared latent comorbidities, and can facilitate generating a robust EHR dataset for further mining and modeling.
Funding
This work was supported by Geisinger Health System.