Abstract
Today’s increase in scientific literature requires the efficient methods of data mining for improving the extraction of the useful information from texts. In this manuscript, we used a data and text mining method to identify fusions and their protein-protein interactions from published biomedical text. The extracted fusion proteins and their protein-protein interactions are used as a training set for a Naïve Bayes classifier that is further used for final identification of testing dataset, consisting of 1817 fusions. Our method has a literature corpus, text and annotation mappers; keywords, rule bases, negative tokens, and pattern extractor; synonym tagger, normalization, regular expression mapper; and Naïve Bayes classifier. We classified 1817 unique fusion proteins and their corresponding 2908 protein-protein interactions for 18 cancer types. Therefore, it can be used for screening literature for identifying mentions unique cases of fusions that can be further used for downstream analysis. It is available at http://protfus.md.biu.ac.il/.