Abstract
Multiple sequence alignment is a fundamental problem in bioinformatics, which is why a large number of tools are used to align sequences under a prescribed (biologically inspired) objective function. Often practitioners use software’s default parameters to align sequences. However, a different parameter setting may provide a much higher-quality alignment for the specific set of input sequences. One highly-accurate method of choosing parameter vectors for specific input is Parameter Advising, which selects from a set of alignments produced using a carefully constructed collection of parameter configurations. To choose among the candidate alignments, it would be ideal to use each alignment’s accuracy, but in practice, a reference from which to calculate this measure is not available. One must estimate the accuracy of different alignments to rank them. The accuracy estimator Facet (short for Feature-based accuracy estimator) computes a single estimate of accuracy as a linear combination of efficiently-computable feature functions. We introduce Facet-NN and Facet-LR which both use the same underlying feature functions as Facet (as they were shown to be accurate), but since they are built on top of highly efficient machine learning protocols, they can take advantage of a much larger training corpus. Not only does this evolution allow us to train on much larger datasets, it produces an estimator that is more correlated with true accuracy. When used in Parameter Advising, Facet-NN and Facet-LR show an increase of 6% over using only the default parameter vector, which is a 2% increase over using Facet for the same task.
Competing Interest Statement
The authors have declared no competing interest.