Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa

Mol Biol Evol. 1997 Apr;14(4):428-41. doi: 10.1093/oxfordjournals.molbev.a025779.

Abstract

The reconstruction of phylogenetic history is predicated on being able to accurately establish hypotheses of character homology, which involves sequence alignment for studies based on molecular sequence data. In an empirical study investigating nucleotide sequence alignment, we inferred phylogenetic trees for 43 species of the Apicomplexa and 3 of Dinozoa based on complete small-subunit rDNA sequences, using six different multiple-alignment procedures: manual alignment based on the secondary structure of the 18S rRNA molecule, and automated similarity-based alignment algorithms using the PileUp, ClustalW, TreeAlign, MALIGN, and SAM computer programs. Trees were constructed using neighboring-joining, weighted-parsimony, and maximum-likelihood methods. All of the multiple sequence alignment procedures yielded the same basic structure for the estimate of the phylogenetic relationship among the taxa, which presumably represents the underlying phylogenetic signal. However, the placement of many of the taxa was sensitive to the alignment procedure used; and the different alignments produced trees that were on average more dissimilar from each other than did the different tree-building methods used. The multiple alignments from the different procedures varied greatly in length, but aligned sequence length was not a good predictor of the similarity of the resulting phylogenetic trees. We also systematically varied the gap weights (the relative cost of inserting a new gap into a sequence or extending an already-existing gap) for the ClustalW program, and this produced alignments that were at least as different from each other as those produced by the different alignment algorithms. Furthermore, there was no combination of gap weights that produced the same tree as that from the structure alignment, in spite of the fact that many of the alignments were similar in length to the structure alignment. We also investigated the phylogenetic information content of the helical and nonhelical regions of the rDNA, and conclude that the helical regions are the most informative. We therefore conclude that many of the literature disagreements concerning the phylogeny of the Apicomplexa are probably based on differences in sequence alignment strategies rather than differences in data or tree-building methods.

MeSH terms

  • Algorithms
  • Animals
  • Apicomplexa / genetics*
  • Computer Simulation
  • DNA, Ribosomal / genetics*
  • Likelihood Functions
  • Molecular Sequence Data
  • Phylogeny*
  • RNA, Ribosomal, 18S / genetics*
  • Sequence Alignment

Substances

  • DNA, Ribosomal
  • RNA, Ribosomal, 18S

Associated data

  • GENBANK/L07375
  • GENBANK/L16996
  • GENBANK/L16997
  • GENBANK/L19068
  • GENBANK/L19069
  • GENBANK/L24380
  • GENBANK/L24381
  • GENBANK/L24382
  • GENBANK/L24383
  • GENBANK/L24384
  • GENBANK/L25642
  • GENBANK/M14599
  • GENBANK/M19712
  • GENBANK/M64244
  • GENBANK/M97703
  • GENBANK/U00458
  • GENBANK/U03069
  • GENBANK/U03070
  • GENBANK/U03071
  • GENBANK/U07812
  • GENBANK/X64340
  • GENBANK/X64341
  • GENBANK/X64342
  • GENBANK/X64343
  • GENBANK/X65508
  • GENBANK/X68523
  • GENBANK/X75429
  • GENBANK/X75430
  • GENBANK/X75453
  • GENBANK/X75762