RT Journal Article SR Electronic T1 Semi-Supervised Learning of the Electronic Health Record with Denoising Autoencoders for Phenotype Stratification JF bioRxiv FD Cold Spring Harbor Laboratory SP 039800 DO 10.1101/039800 A1 Brett K. Beaulieu-Jones A1 Casey S. Greene YR 2016 UL http://biorxiv.org/content/early/2016/02/18/039800.abstract AB Patient interactions with health care providers result in entries to electronic health records (EHRs). EHRs were built for clinical and billing purposes but contain many data points about an individual. Mining these records provides opportunities to extract electronic phenotypes that can be paired with genetic data to identify genes underlying common human diseases. This task remains challenging: high quality phenotyping is costly and requires physician review; many fields in the records are sparsely filled; and our definitions of diseases are continuing to improve over time. Here we develop and evaluate a semi-supervised learning method for EHR phenotype extraction using denoising autoencoders for phenotype stratification. By combining denoising autoencoders with random forests we find classification improvements across simulation models, particularly in cases where only a small number of patients have high quality phenotype. This situation is commonly encountered in research with EHRs. Denoising autoencoders perform dimensionality reduction allowing visualization and clustering for the discovery of new subtypes of disease. This method represents a promising approach to clarify disease subtypes and improve genotype-phenotype association studies that leverage EHRs.