Abstract
A major shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting (ILS). Coalescence methods explicitly address this problem, but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescence methods, retroelement insertions have emerged as powerful phylogenomic markers for species tree estimation. We show that two recently proposed methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the species tree under the multispecies coalescent model, with retroelement insertions following a neutral infinite sites model of mutation. The accuracy of these and other methods for inferring species trees with retroelements has not been assessed in simulation studies. We simulate retroelements for four different species trees, including three with short branch lengths in the anomaly zone, and assess the performance of eight different methods for recovering the correct species tree. We also examine whether ASTRAL_BP recovers accurate internal branch lengths for internodes of various lengths (in coalescent units). Our results indicate that two recently proposed ILS-aware methods, ASTRAL_BP and SDPquartets, as well as the newly proposed ASTRID_BP, always recover the correct species tree on data sets with large numbers of retroelements even when there are extremely short species-tree branches in the anomaly zone. Dollo parsimony performed almost as well as these ILS-aware methods. By contrast, unordered parsimony, polymorphism parsimony, and MDC recovered the correct species tree in the case of a pectinate tree with four ingroup taxa in the anomaly zone, but failed to recover the correct tree in more complex anomaly-zone situations with additional lineages impacted by extensive incomplete lineage sorting. Camin-Sokal parsimony always reconstructed an incorrect tree in the anomaly zone. ASTRAL_BP accurately estimated branch lengths when internal branches were very short as in anomaly zone situations, but branch lengths were upwardly biased by more than 35% when species tree branches were longer. We derive a mathematical correction for these distortions, assuming the expected number of new retroelement insertions per generation is constant across the species tree. We also show that short branches do not need to be corrected even when this assumption does not hold; therefore, the branch lengths estimates produced by ASTRAL_BP may provide insight into whether an estimated species tree is in the anomaly zone.
Competing Interest Statement
The authors have declared no competing interest.