TY - JOUR T1 - Improved Algorithms for Finding Edit Distance Based Motifs JF - bioRxiv DO - 10.1101/020131 SP - 020131 AU - Soumitra Pal AU - Sanguthevar Rajasekaran Y1 - 2015/01/01 UR - http://biorxiv.org/content/early/2015/05/31/020131.abstract N2 - Motif search is an important step in extracting meaningful patterns from biological data. Since the general problem of motif search is intractable, there is a pressing need to develop efficient exact and approximation algorithms to solve this problem. We design novel algorithms for solving the Edit-distance-based Motif Search (EMS) problem: given two integers l, d and n biological strings, find all strings of length l that appear in each input strings with at most d substitutions, insertions and deletions. These algorithms have been evaluated on several challenging instances. Our algorithm solves a moderately hard instance (11, 3) in a couple of minutes and the next difficult instance (14, 3) in a couple of hours whereas the best previously known algorithm, EMS1, solves (11, 3) in a few hours and does not solve (13, 4) even after 3 days. This significant improvement is due to a novel and provably efficient neighborhood generation technique introduced in this paper. This efficient approach can be used in other edit distance based applications in Bioinformatics, such as k-spectrum based sequence error correction algorithms. We also use a trie based data structure to efficiently store the candidate motifs in the neighbourhood and to output the motifs in a sorted order. ER -