Exploiting Large Datasets Improves Accuracy Estimation for Multiple Sequence Alignment

Luis Cedillo; Hector Richart Ruiz; Dan DeBlasio

doi:10.1101/2022.05.22.493004

Abstract

Multiple sequence alignment is a fundamental problem in bioinformatics, which is why a large number of tools are used to align sequences under a prescribed (biologically inspired) objective function. Often practitioners use software’s default parameters to align sequences. However, a different parameter setting may provide a much higher-quality alignment for the specific set of input sequences. One highly-accurate method of choosing parameter vectors for specific input is Parameter Advising, which selects from a set of alignments produced using a carefully constructed collection of parameter configurations. To choose among the candidate alignments, it would be ideal to use each alignment’s accuracy, but in practice, a reference from which to calculate this measure is not available. One must estimate the accuracy of different alignments to rank them. The accuracy estimator Facet (short for Feature-based accuracy estimator) computes a single estimate of accuracy as a linear combination of efficiently-computable feature functions. We introduce Facet-NN and Facet-LR which both use the same underlying feature functions as Facet (as they were shown to be accurate), but since they are built on top of highly efficient machine learning protocols, they can take advantage of a much larger training corpus. Not only does this evolution allow us to train on much larger datasets, it produces an estimator that is more correlated with true accuracy. When used in Parameter Advising, Facet-NN and Facet-LR show an increase of 6% over using only the default parameter vector, which is a 2% increase over using Facet for the same task.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

https://github.com/deblasiolab/Facet-NN

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.