Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi

Christopher B. Cole; Sejal Patel; Leon French; Jo Knight

doi:10.1101/073460

Abstract

Recent growth in both the scale and the scope of large publicly available ontologies has spurred the development of computational methodologies which can leverage structured information to answer important questions. However, ontological labels, or “terms” have thus far proved difficult to use in practice; text mining, one crucial aspect of electronically understanding and parsing the biomedical literature, has historically had difficulty identifying “terms” in literature. In this article, we present goldi, an open source R package whose goal it is to identify terms of variable length in free form text. It is available at https://github.com/Chris1221/goldi or through CRAN. The algorithm works through identifying words or synonyms of words present in individual terms and comparing the number of present words to an acceptance function for decision making. In this article we present the theoretical rationale behind the algorithm, as well as practical advice for its usage applied to Gene Ontology term identification and quantification. We additionally detail the options available and describe their respective computational efficiencies.

Footnotes

↵*CCole019{at}uOttawa.ca

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.