dialogi: Utilising NLP with chemical and disease similarities to drive the identification of Drug-Induced Liver Injury literature

Nicholas M Katritsis; Anika Liu; Gehad Youssef; Sanjay Rathee; Méabh MacMahon; Woochang Hwang; Lilly Wollman; Namshik Han

doi:10.1101/2022.03.11.483929

ABSTRACT

Drug-Induced Liver Injury (DILI), despite its low occurrence rate, can cause severe side effects or even lead to death. Thus, it is one of the leading causes for terminating the development of new, and restricting the use of already-circulating, drugs. Moreover, its multifactorial nature, combined with a clinical presentation that often mimics other liver diseases, complicate the identification of DILI-related literature, which remains the main medium for sourcing results from the clinical practice and experimental studies. In this work– contributing to the ‘Literature AI for DILI Challenge’ of the Critical Assessment of Massive Data Analysis (CAMDA) 2021– we present an automated pipeline for distinguishing between DILI-positive and negative papers. We used Natural Language Processing (NLP) to filter out the uninformative parts of a text, and identify and extract mentions of chemicals and diseases. We combined that information with small-molecule and disease embeddings, which are capable of capturing chemical and disease similarities, to improve classification performance. The former are directly sourced from the Chemical Checker (CC). For the latter, we collected data that encode different aspects of disease similarity from the National Library of Medicine’s (NLM) Medical Subject Headings (MeSH) thesaurus and the Comparative Toxicogenomics Database (CTD). Following a similar procedure as the one used in the CC, vector representations for diseases were learnt and evaluated. Two Neural Network (NN) classifiers were developed: one that only accepts texts as input (baseline model) and an augmented classifier that also utilises chemical and disease embeddings (extended model). We trained, validated, and tested the models through a Nested Cross-Validation (NCV) scheme with 10 outer and 5 inner folds. During this, the baseline and extended models performed virtually identically, with macro F₁-scores of 95.04 ± 0.61% and 94.80 ± 0.41%, respectively. Upon validation on an external, withheld, dataset, representing imbalanced data, the extended model achieved an F₁-score of 91.14 ± 1.62%, outperforming its baseline counterpart, which got a lower score of 88.30 ± 2.44%. We make further comparisons between the classifiers and discuss future improvements and directions, including utilising chemical and disease embeddings for visualisation and exploratory analysis of the DILI-positive literature.

Competing Interest Statement

A.L. is funded by GlaxoSmithKline (GSK). S.R. is funded by JW Pharmaceutical. M.M. is an employee of LifeArc. W.H. and N.H. are funded by LifeArc. N.H. is a co-founder of KURE.ai and CardiaTec Biosciences, and an advisor at Biorelate, Promatix, Standigm, and VeraVerse.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.