PT - JOURNAL ARTICLE AU - Maxwell W. Libbrecht AU - Jeffrey A. Bilmes AU - William Stafford Noble TI - Eliminating redundancy among protein sequences using submodular optimization AID - 10.1101/051201 DP - 2016 Jan 01 TA - bioRxiv PG - 051201 4099 - http://biorxiv.org/content/early/2016/05/02/051201.short 4100 - http://biorxiv.org/content/early/2016/05/02/051201.full AB - Motivation Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success in many fields but is not yet widely used in biology. We apply submodular optimization to the problem of removing redundancy in protein sequence data sets. This is a common step in many bioinformatics and structural biology workflows, including creation of non-redundant training sets for sequence and structural models as well as selection of “operational taxonomic units” from metagenomics data.Results We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods. In particular, we compare to a widely used, heuristic algorithm implemented in software tools such as CD-HIT, as well to as a variety of standard clustering methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is theoretically optimal under some assumptions, and it is flexible and intuitive because it applies generic methods to optimize one of a variety of objective functions. This application serves as a model for how submodular optimization can be applied to other discrete problems in biology.Availability Source code is available at https://github.com/mlibbrecht/submodular_sequence_repset.Contact william-noble{at}uw.edu