Abstract
The effects of inter-individual variability on disease treatment and prevention are important to the goals of “precision medicine” 1. In biomedical research, consideration of racial or ethnic differences allows generation and exploration of hypotheses about interactions among genetic and environmental factors responsible for differential medical outcomes. The US National Institutes of Health, therefore recommends adequate participation of subjects from ethnic minority groups in research studies. Nevertheless, considerable debate has focused on validity of race or ethnicity as biological construct 2. Inconsistent definition of race/ethnicity and insignificant genetic variations between ethnic groups have invited disregard to this construct 3. On the contrary, differences in prevalence, expression and outcomes of various diseases among ethnic groups argue for continued and focused attention to ethnicity as important predicting variable. In context of Alzheimer’s disease (AD), we have previously reported that ethnicity does moderates the proteomic markers of dementia 4. Here, we attempted to classify and predict selfreported ethnicity (Hispanic or non-Hispanic white, [NHW]) using a limited serum profile of 107 proteins. Random Forest (RF) classification method was able to discriminate those two ethnicities with 95% accuracy and could successfully predict ethnicity in an independent test-set (Area under ROC curve: 0.97). Variable selection method led to a condensed set of six proteins which yielded comparable classification and prediction accuracy. Our results provide preliminary evidence for proteomic variability between ethnic groups, and biological validity of ethnicity construct. Moreover, they also offer an opportunity to exploit these differences towards the objectives of precision medicine.
Main
A secondary objective of the Texas Alzheimer’s Research & Care Consortium (TARCC) project, an ongoing longitudinal cohort study aimed at examining the role of clinical and biological markers of AD, is to investigate ethnic variability with regard to clinical features and biomarkers. Majority of participants in TARCC belong to Hispanics (Mexican-Americans; MA) or non-Hispanic whites (NHW), two major ethnic groups of Texas. Like most research studies, ethnicity in TARCC was self-reported. Five hundred and thirty cognitively healthy elderly individuals (312 Hispanics and 218 NHWs) who were recruited as control subjects in TARCC study were used for this analysis. Two ethnic groups differed significantly in terms of age, education, prevalence of diabetes and obesity, and an “omnibus” dementia severity metric (i.e., “δ”) 4 (Table 1). Although nominally non-demented, these control subjects were recruited for a convenient sample study and two groups differed with regard to “δ“. After removing the variance related to available confounding variables, 107 serum proteins from baseline visit and RF algorithm classified two ethnicity classes in a training set (comprising 75% of total the sample) with 95% accuracy. In the test-set (the remaining 25% of total sample), this classification model provided good predictive accuracy with specificity and sensitivity of 93.10% and 97.33% respectively with area under Receiver Operating Characteristic (AUC-ROC) curve of 0.97. Twenty most discriminatory variables, identified by variable importance (VI) measure could predict ethnicity with AUC-ROC of 0.99, sensitivity of 97.33% and specificity of 93.10%. . Finally, in order to limit the number of classifying variables to facilitate future replicability, a variable selection method provided a set of six protein variables, which provided a comparable predictive accuracy with ROC AUC of 0.99, sensitivity of 94.87% and specificity of 94.55%. Results from all three models are summarized in Table 2.
These results support the potential validity and utility of ethnicity as a biological construct 5. Even though Hispanic group is a heterogeneous mixture of various ancestries 6 and not as mutually exclusive from NHWs as for example NHWs and east-Asians, serum protein’s ability to distinguish it from NHWs highlight importance of ethnicity in not only biomarker research but also disease pathophysiology. Previously, ethnicity related differences in expression of several blood-based biomarkers, including cytokines 7–9 have been reported from TARCC 10–13, Multi-Ethnic Study of Atherosclerosis (MESA) 14–21 and National Health and Nutrition Examination Survey (NHANES) 22–29. Combined with our results, these findings lay a common theme that supports our hypothesis that ethnicity does impact the circulating levels of proteins. The mechanism(s) that mediate those differences remain unclear. In the simplest application, the biomarkers of ethnicity can be viewed as biological parameters that are themselves significantly correlated with an ethnicity and not necessarily responsible for ethnic groupings. Therefore, the resulting protein profile may not be the biological determinants of ethnicity per se but a reflection instead of cross-ethnic differences in lifestyle or disease processes. These differences among ethnic groups may be responsible for disparate disease-susceptibility, treatment response and/or outcomes. Our results strongly emphasize the importance of self-reported ethnicity information in research and underlines the need for validation of research results across ethnicities.
Multiple prior attempts to measure the correspondence between self-reported ethnicity class and genetic clusters have provided mixed results 2,3,30,31. Human genome sequencing project and related studies concluded that ethnicity is a weak surrogate of various genetic and non-genetic factors in correlations with health status 32. In fact, individuals from different ethnicity were found to be genetically similar than individuals from their own ethnicity 3,33. Surprisingly, a limited proteomic assay which was not pre-selected to predict ethnicity or to develop ethnicity specific traits, could precisely differentiate & predict ethnicity. Complete human proteome may provide greater precision and possibly provide ethnicity specific proteomic compositions. Our results therefore emphasize importance of proteomic research and endorse the hypothesis that proteomics is not just the new genomics 34 and include phenotypic information from protein posttranslational modifications, protein interactions, metabolite abundance, and, protein stoichiometry 35.
Classification models using a reduced set of the six most important differentiating proteins resulted in comparable predictive accuracy. These six proteins were-Connective tissue growth factor (CTGF), interferon (IFN) gamma, interleukin (IL)-12p40, IL-13, transforming growth factor (TGF) alpha, and thrombospondin 1 (THBS1). Relative expression of each of these six proteins is shown in fig 3. IL13 levels were relatively higher in Hispanic group, although not statistically significant (p=0.12). All other proteins were over-expressed in NHW group, levels were statistically significant with exception of IL12p40 (p=0.83). These reflect mean differences, however, Figure 3 also suggests clear cross-group differences in these proteins’ distributions. Protein-protein interactions (PPI) network of identified proteins indicated that these selected proteins were enriched for various processes including regulation of protein phosphorylation, cell activation and intracellular signal transduction (Table 3). CTGF and TGF-alpha are epidermal growth factor receptor (EGFR; a member of tyrosine kinase family) ligands which controls cellular functions during development and homeostasis through protein phosphorylation 36. EGFRs have been found to be dysregulated in several epithelial cancers like lung, colon etc and known to vary among ethnic groups 37,38. Among these epithelial cancers, ethnicity influence response to treatment and mortality. For example, response to EGFR tyrosine kinase inhibitors (e.g. geftinib) treatments in lung cancer can be predicted by ECFR ligands levels in blood and ethnicity 39. Among ethnicities, Hispanic and Asian groups respond better to these treatments compared to NHWs 39. Lower levels of EGFR ligands like CTGF & TGF-alpha may play a role in comparatively better response to EGFR inhibitors in Hispanics. Our results do not provide any direct evidence to this hypothesis but support the arguments for further investigations into this direction. If serum levels of ligands like TGF alpha can predict response to treatments, we can anticipate rapid development in precision medicine based on serum proteomic assays. Also, if EGFR activity is proven to be significantly different among ethnic groups, ethnicity can provide individual variability to patients which is key to precision medicine.
All six selected proteins have been previously reported to associate with dementia phenotype “δ” in ethnicity-adjusted models from TARCC cohort 41,42. Details about concept and utility of δ is can be obtained from past publications 43–45,46. There are cross-ethnic differences in the δ-scores of non-demented TARCC participants, and in δ’s serum protein biomarkers 12. Although the reported results were adjusted for δ, the appearance of δ-specific biomarkers among the most discriminating set of ethnicity related biomarkers may suggest cross-ethnic differences in general intelligence.
Results of this study are also significant in context of “Hispanic paradox”, an epidemiological paradox observed in southwestern states of US including Texas 40. Socioeconomically, Hispanics were disadvantaged and resembled African-Americans but their health status was comparable to NHWs. Precise biological basis to this phenomenon is unclear and several theories have been proposed to explain this paradox i.e. Salmon bias effect, selective migration, better social support etc. Differential response to EGFR inhibitors in lung cancer and expression of EGFR ligands in Hispanics is probably an example of biological factors which offset the impact of socio-economic adversities. Our study provides merely a direction for further research focused on ethnicity specific biological differences.
Our analysis has notable limitations. Even though we included a relatively large sample size and more than 100 serum proteins, quality control measures to remove batch effects, and adjustment for demographic and clinical factors, these results require validation in another independent sample. A complete assay of serum proteomics could be more valuable in such analysis. Also, all proteins were measured cross-sectionally and it was not possible to verify if ethnicity specific profile were stable over time. Finally, patient population used in this study included elderly individuals and these results need validation in younger adults 47.
In summary, we provide proof of concept for the accurate prediction of self-reported ethnicity from a set of serum proteins. There exists significant support in literature for differential protein expression by ethnic group. These may inform disparities in disease prevalence, severity or treatment response and may be of value to the development of “precision medicine”.
Methods
Full design of study is outlined in supplementary fig 1. Data were derived from Texas Alzheimer’s Research & Care Consortium (TARCC) project, an ongoing longitudinal cohort study aimed at examining the role of clinical and biological markers in the development and progression of Alzheimer’s disease (AD). Each participant in this study provided written informed consent prior to enrollment, evaluation and biomarker blood draw. Detailed description of the TARCC study design can be found elsewhere 48,49.
The demographic and clinical characteristics of patients are summarized in supplementary Table 1. For this analysis, 530 cognitively healthy elderly individuals (Hispanics n= 312, NHWs n= 218) who participated as control subjects in TARCC study were considered. Each participant underwent a standardized evaluation at respective TARCC sites on baseline and then annually. At baseline, all subjects underwent clinical interview, neuopsychological assessment and biomarker blood draw. A consensus group for the respective site determined their cognitive status as cognitively healthy individuals. Study participants with serious medical illness were not considered for the study. Inclusion and exclusion criteria have been discussed previously in details 48.
Proteomic profiling
Non-fasting samples were collected with 10mL serum-separating (tiger-top) vacutainers tubes at the baseline visit. Samples were allowed to clot at room temperature for 30 minutes in a vertical position before being centrifuged at 1300 × g for 10 minutes. Next, 1mL aliquots were transferred into polypropylene cryovial tubes labeled with freezerworks barcodes and placed into −80° C freezers for storage until use.
Serum samples were shipped to Myriad RBM, a CLIA-certified biomarker testing Laboratory based in Austin, TX, USA. The samples were assayed on the luminex-based HumanMAP 1.0 platform. Over 100 proteins were quantified utilizing fluorescent microspheres with protein-specific antibodies. Information regarding the least detectable dose (LDD), inter-run coefficient of variation, dynamic range, overall spiked standard recovery, and cross-reactivity with other HumanMAP analytes can be obtained online (https://rbm.myriad.com/products-services/humanmap-services/humanmap/).
Statistical analysis
Data pre-treatment
Analytes with >15% missing data or value below the lower detection limit (LDL), were excluded from analysis. Out of 124, 107 analytes were included for further analysis and missing values were imputed with k-nearest neighbor (KNN) method with 10 neighbors per analyte based on Euclidean distance. KNN is an imputation method that accounts for the local similarity of the data by identifying similar analytes with similar peak intensity profiles via a distance 50. Next, data were log(10)-transformed and normalized to reduce systemic variance among the data and improve the performance of downstream analysis. Correlation and colinearity was tested to ensure the variables are independent. Pearson correlation was high for some variables like different interleukins which is not completely unexpected, however the Variance Inflation Factors (VIF) were less than 10 for all variables which is accepted threshold for multi-collinearity.
Following potential confounders were considered: age, sex, study site, batch number, APOE status, educational achievement and history of hypertension, diabetes, hyperlipidemia and obesity. Each analyte variable was regressed onto covariates to adjust for the confounding effects on analytes. The standardized residuals (mean of zero and standard deviation of one) from the linear regression were then used in further analysis to classify and predict ethnicity.
Classification methods
Dataset was divided into a training set (75%) and an independent test dataset (25%), and we trained a Random Forest (RF) in the training set of Hispanic and NHW subjects using their proteomic profiles. We evaluated its performance using cross-validation on test-set and scored predictive power in a receiver operating characteristic (ROC) analysis.
We selected Random Forest (RF) model 51, an increasingly popular method in machine learning that characterizes structure in high dimensional data while making no distributional assumptions about the response variable or predictors. RF model (using randomForest, an R package) uses an ensemble of classification trees was employed to develop a Hispanic versus NHW classifier from the training set. RF builds a forest with user defined number of regression trees (ntree) and number of predictors sampled for splitting at each node (mtry); we used 500 classification trees and mtry=Vn. Trees are built by recursive binary partitioning of observations into subsets and each tree is asymptomatically unbiased. A portion of the observations is excluded from the tree building process (i.e. out-of-bag or OOB data) for each individual tree to be used as an independent estimate of the prediction accuracy. At each node in a tree, the variables used to develop that tree are tested for the split in the data that achieves greatest reduction in error. RF used this ensemble of trees to make predictions of class based on majority votes to a given sets of parameters. During this process, “variable importance (VI)” is assessed that measures the relative predictive influence of individual variables on the classification model. VI based on permutation accuracy importance measure is most advanced measure because of its ability to evaluate the variable importance by the mean decrease in accuracy using the internal out-of-bag (OOB) estimates while the forests are constructed 51–53.
Variable selection
Although RF provides a measure of variable importance, it does not automatically choose the optimal number of variables that can yield the good classification accuracy with low misclassification error rate. To identify a set of limited number of variables, we employed two strategies. First, we used the top-20 variables based on VI measure and second, we employed a backward elimination of variables using OOB errors 54 This was performed using R package varSelRF and variable elimination from RF was carried out by successively eliminating the least important variables (with importance as returned from random forest) and minimizing OOB error at the same time. At each iteration, a fraction of variables (default value = 0.2) were excluded from those in the previous forest that lowered computational cost and is coherent with the idea of an “aggressive variable selection” approach. In this process, model resolution gets better as the number of variables considered becomes smaller. After fitting all forests, the selected set of features is the one whose OOB error rate is within u = 1 standard error of the minimum error rate of all forests. This is in agreement with the “1 s.e. rule” commonly used in classification trees literature 55. The objective of these strategies was to select optimal subset of variables while keeping an error rate close to that from the original model. These subsets of variables were then used to classify ethnicity groups in training data and tested with independent test set.
Classification performance
At each step, the performance of the RF was assessed by OOB estimate of error rate 52,56 and a separate accuracy assessment on an independent data set. Confusion matrix was subsequently constructed to compare true class with the class assigned by the classifier and to calculate overall accuracy. Area under the curve (AUC) of the receiver operating characteristic (ROC) curve estimated the predictive performance of models 57,58.
Functional analysis of protein-set
Most significant proteins variables derived from variable selection method were assessed for their functional correlates with the STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database (http://string-db.org) 59. STRING allows exploration of how these proteins are inter-related to form protein-protein interaction networks by applying all active interaction sources (experiments, databases and text mining). It can also provide biological functions and specific cellular pathways involving those proteins.
Acknowledgemnts
Author contributions
RJB designed experiments, analyzed data and wrote the paper. RMP pre-processed data and assisted data analysis. All authors edited the paper and DRR supervised the research.
This study was made possible by the Julia and Van Buren Parr endowment for dementia-related research and the Texas Alzheimer’s Research and Care Consortium (TARCC) funded by the state of Texas through the Texas Council on Alzheimer’s Disease and Related Disorders. The results have been presented as a poster at the University of Texas Health Science Center, San Antonio Research Day, 2016.
Investigators from the Texas Alzheimer’s Research and Care Consortium: Baylor College of Medicine: Valory Pavlik, PhD, Paul Massman PhD, Eveleen Darby MA/MS, Monica Rodriguear MA, Aisha Khaleeq MD; Texas Tech University Health Sciences Center: John C. DeToledo, MD, Henrick Wilms MD, PhD, Kim Johnson PhD, Victoria Perez, Michelle Hernandez; University of North Texas Health Science Center: Thomas Fairchild PhD, Janice Knebl DO, Sid E. O’Bryant PhD, James R. Hall PhD, Leigh Johnson PhD, Robert C. Barber PhD, Douglas Mains DrPH, Lisa Alvarez, Adriana Gamboa; University of Texas at Austin: John Bertelson MD; University of Texas Southwestern Medical Center: Perrie Adams PhD, Munro Cullum PhD, Roger Rosenberg MD, Benjamin Williams MD, PhD, Mary Quiceno MD, Joan Reisch PhD, Linda S. Hynan PhD, Ryan Huebinger PhD, Janet Smith BS, Barb Davis MA, Trung Nguyen MD, PhD; University of Texas Health Science Center – San Antonio: Donald Royall MD, Raymond Palmer PhD, Marsha Polk; Texas A&M University Health Science Center: Farida Sohrabji PhD, Steve Balsis PhD, Rajesh Miranda, PhD; University of North Carolina: Kirk C. Wilhelmsen MD, PhD, Jeffrey L. Tilson PhD, Scott Chasse, PhD.
Competing financial interests
Drs. Royall and Palmer have disclosed the results of these analyses to the University of Texas Health Science Center at San Antonio (UTHSCSA), which has filed patent application 2012.039.US1.HSCS and provisional patents 61/603,226 and 61/671,858 relating to the latent variable δ’s construction and biomarkers.
Dr. Royall reports research funding from the Department of Defense (DoD), Eisai and Novartis. Drs. Royall and Palmer receive additional funding from the NIH. Dr. Bishnoi reports no biomedical financial interests or potential conflicts of interest.
Materials and correspondence
Correspondence should be addressed to RB (rbishnoi{at}augusta.edu). Data are from the Texas Alzheimer’s Research and Care Consortium (TARCC) study. Requests for TARCC data may be sent to www.txalzresearch.org or to the below contact: Robert Barber, PhD, Associate Professor, Executive Director, Institute for Molecular Medicine, UNT Health Science Center, Robert.Barber{at}unthsc.edu.