TY - JOUR T1 - Evaluating disease similarity using latent Dirichlet allocation JF - bioRxiv DO - 10.1101/030593 SP - 030593 AU - James Frick AU - Rajarshi Guha AU - Tyler Peryea AU - Noel T. Southall Y1 - 2015/01/01 UR - http://biorxiv.org/content/early/2015/11/03/030593.abstract N2 - Measures of similarity between diseases have been used for applications from discovering drug-target interactions to identifying disease-gene relationships. It is challenging to quantitatively compare diseases because much of what we know about them is captured in free text descriptions. Here we present an application of Latent Dirichlet Allocation as a way to measure similarity between diseases using textual descriptions. We learn latent topic representations of text from Online Mendelian Inheritance in Man records and use them to compute similarity. We assess the performance of this approach by comparing our results to manually curated relationships from the Disease Ontology. Despite being unsupervised, our model recovers a record’s curated Disease Ontology relations with a mean Receiver Operating Characteristic Area Under the Curve of 0.80. With low dimensional models, topics tend to represent higher level information about affected organ systems, while higher dimensional models capture more granular genetic and phenotypic information. We examine topic representations of diseases for mapping concepts between ontologies and for tagging existing text with concepts. We conclude topic modeling on disease text leads to a robust approach to computing similarity that does not depend on keywords or ontology. ER -