TY - JOUR T1 - Breaking bud: probing the scalability limits of phylogenetic network inference methods JF - bioRxiv DO - 10.1101/056572 SP - 056572 AU - Hussein A Hejase AU - Kevin J Liu Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/06/01/056572.abstract N2 - Background Branching events in phylogenetic trees reflect strictly bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges:(1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown. In this study, we quantify the performance of state-of-the-art phylogenetic network inference methods on large-scale datasets using empirical data sampled from natural mouse populations and synthetic data capturing a wide range of evolutionary scenarios.Results We find that, as in the case of phylogenetic tree inference, the performance of leading network inference methods is negatively impacted by both dimensions of dataset scale. In general, we found that topological accuracy degrades as the number of taxa increases; a similar effect was observed with increased sequence mutation rate. The most accurate methods were probabilistic inference methods which maximize either likelihood under coalescent-based models or pseudo-likelihood approximations to the model likelihood. Furthermore, probabilistic inference methods with optimization criteria which did not make use of gene tree root and/or branch length information performed best-a result that runs contrary to widely held assumptions in the literature. The improved accuracy obtained with probabilistic inference methods comes at a computational cost in terms of runtime and main memory usage, which quickly become prohibitive as dataset size grows past thirty taxa.Conclusions We conclude that the state of the art of phylogenetic network inference lags well behind the scope of current phylogenomic studies. New algorithmic development is critically needed to address this methodological gap.ILSincomplete lineage sortingMPmaximum parsimonyMDCminimize deep coalescenceMLEmaximum likelihood estimationMPLmaximum pseudo-likelihoodSNaQSpecies Networks applying QuartetsHMMhidden Markov modelRF distanceRobinson-Foulds distance ER -