Applying Lexical Link Analysis to Discover Insights from Public Information on COVID-19

Ying Zhao; Charles C. Zhou

doi:10.1101/2020.05.06.079798

Abstract

SARS-Cov-2, the deadly and novel virus, which has caused a worldwide pandemic and drastic loss of human lives and economic activities. An open data set called the COVID-19 Open Research Dataset or CORD-19 contains large set full text scientific literature on SARS-CoV-2. The Next Strain consists of a database of SARS-CoV-2 viral genomes from since 12/3/2019. We applied an unique information mining method named lexical link analysis (LLA) to answer the call to action and help the science community answer high-priority scientific questions related to SARS-CoV-2. We first text-mined the CORD-19. We also data-mined the next strain database. Finally, we linked two databases. The linked databases and information can be used to discover the insights and help the research community to address high-priority questions related to the SARS-CoV-2’s genetics, tests, and prevention.

Significance Statement In this paper, we show how to apply an unique information mining method lexical link analysis (LLA) to link unstructured (CORD-19) and structured (Next Strain) data sets to relevant publications, integrate text and data mining into a single platform to discover the insights that can be visualized, and validated to answer the high-priority questions of genetics, incubation, treatment, symptoms, and prevention of COVID-19.