Abstract
The availability of large-scale biobanks linking rich phenotypes and biological measures are a powerful opportunity for scientific discovery. However, real-world collections frequently have extensive non-random missing data. Machine learning methods are able to predict missing data but performance is significantly impaired by block-wise missingness inherent to many biobanks. To address this, we developed Missingness Adapted Group-wise Informed Clustered LASSO (MAGIC-LASSO) which performs hierarchical clustering of variables based on missingness followed by sequential Group LASSO within clusters. Variables are pre-filtered for missingness and balance between training and target sets with final models built using stepwise inclusion of features ranked by completeness. This research has been conducted using the UK Biobank (n>500k) to predict unmeasured Alcohol Use Disorders Identification Test (AUDIT.) The phenotypic correlation between measured and predicted total score was 0.67 while genetic correlations between independent subjects was >0.86, demonstrating the method has significant accuracy and utility.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Robert M. Kirkpatrick, PhD, Assistant Professor, Virginia Commonwealth University, Virginia Institute for Psychiatric and Behavioral Genetics, Department of Psychiatry, Robert.Kirkpatrick{at}vcuhealth.org
Roseann E. Peterson, Ph.D., Assistant Professor, Virginia Commonwealth University, Virginia Institute for Psychiatric and Behavioral Genetics, Department of Psychiatry, Peterson.Roseann{at}gmail.com
Bradley Todd Webb, Ph.D., Omics Research Scientist, Principal Investigator, RTI International, GenOmics, Bioinformatics, and Translational Research Center, Biostatistics and Epidemiology Division, bwebb{at}rti.org