Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

T. M. Porter; M. Hajibabaei

doi:10.1101/2021.01.24.427982

Abstract

Background Pseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for obvious pseudogenes in large COI metabarcode datasets. We do this by: 1) describing gene and pseudogene characteristics from a simulated DNA barcode dataset, 2) show the impact of two different pseudogene removal methods on mock metabarcode datasets with simulated pseudogenes, and 3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile were used to detect pseudogenes.

Results Our simulations showed that it was more difficult to identify pseudogenes from shorter amplicon sequences such as those typically used in metabarcoding (∼300 bp) compared with full length DNA barcodes that are used in construction of barcode libraries (∼ 650 bp). It was also more difficult to identify pseudogenes in datasets where there is a high percentage of pseudogene sequences. We show that existing bioinformatic pipelines used to process metabarcode sequences already remove some apparent pseudogenes, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove more.

Conclusions The combination of open reading frame length and hidden Markov model profile analysis can be used to effectively screen out obvious pseudogenes from large datasets. There is more to learn from COI pseudogenes such as their frequency in DNA barcode and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI pseudogenes to public databases to facilitate future studies.

Competing Interest Statement

The authors have declared no competing interest.

List of abbreviations

BLAST: basic local alignment search tool
BOLD: Barcode of Life Data System
COI: cytochrome c oxidase subunit 1 gene
dN/dS: ratio of non-synonymous to synonymous substitions
ESV: exact sequence variant
GC content: guanine-cytosine content
HMM: Hidden Markov Model
ITS: internal transcribed spacer region in the ribosomal RNA operon
K2P: Kimura 2-parameter model of nucleotide substitution
matK: maturase K gene
mtDNA: mitochondrial DNA
nuMT: nuclear encoded mitochondrial sequence
NCBI: National Center for Biotechnology Information
ORF: open reading frame
OTU: operational taxonomic unit
rbcL: ribulose bisphosphate carboxylate large chain gene

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.