Abstract
Codon usage bias (CUB) is pervasive in genomes. Studying its patterns and causes is fundamental for understanding genome evolution. Rapidly emerging large-scale RNA and DNA sequences make studying CUB in many species feasible. Existing software however is limited in incorporating the new data resources. Therefore, I release the software CUA which can compute all popular CUB metrics, including CAI, tAI, Fop, ENC. More importantly, CUA allows users to incorporate user-specific data, such as tRNA abundance and highly expressed genes from considered tissues; this flexibility enables computing CUB metrics for any species with improved accuracy. In sum, CUA eases codon usage studies and establishes a platform for incorporating new metrics in future. CUA is available at http://search.cpan.org/dist/Bio-CUA/ with help documentation and tutorial.
Introduction
One amino acid can be encoded by more than one synonymous codon, and synonymous codons are unevenly used. In particular, some codons are used more often than their synonymous ones in highly expressed genes (Sharp and Li, 1987). To measure the unevenness of codon usage, multiple metrics of codon usage bias (CUB) have been developed, such as Codon Adaptation Index (CAI) (Sharp and Li, 1987), tRNA Adaptation Index (tAI) (dos Reis, et al., 2004), Frequency of optimal codons (Fop) (Ikemura, 1981), and Effective Number of Codons (ENC) (Wright, 1990). With assumptions, all of these metrics except ENC also measure translation efficiency: the larger the metrics, the more efficiently translated the codons are.
CUB can be calculated at two levels: codon and sequence. At the codon level, the metrics measure the relative translation efficiency among codons. For example, codons preferred in highly expressed genes (Sharp and Li, 1987) or matching more cognate tRNAs (Ikemura, 1981) are thought to be translated more efficiently than their synonymous alternatives. At the sequence level, the metrics measure the overall translation efficiency (or, for ENC, codon usage unevenness only) of a sequence, and for CAI and tAI, they are computed by averaging (usually taking the geometric mean) the codon-level values of the sequence’s constituent codons (dos Reis, et al., 2004; Sharp and Li, 1987). Note that ENC can be computed only at the sequence level.
CUB is observed in prokaryotes and eukaryotes (Chen, et al., 2004; Vicario, et al., 2007), and its patterns and causes are informative about translation regulation and natural selection (Akashi, 1994; Bentele, et al., 2013; Bulmer, 1991; Eyre-Walker, 1991; Goodman, et al., 2013; Hambuch and Parsch, 2005; Plotkin and Kudla, 2011; Pop, et al., 2014; Qian, et al., 2012; Singh, et al., 2005). Previously, CUB was studied in a few species where genomes or gene expression data were available. The growing huge amount of DNA and RNA sequence data make studying CUB in many species feasible. For example, tAI can be computed for any sequenced genomes using tRNA copy numbers to approximate tRNA abundance (dos Reis, et al., 2004). For another example, more accurate CAI can be computed by using the highly expressed genes from the considered tissues as the reference gene set other than using a general gene set (Qian, et al., 2012).
Existing software packages, however, are limited in incorporating user-specific data and thus usable only for certain circumstances. codonW (http://codonw.sourceforge.net/) is great for computing sequence-level ENC, CAI, and Fop for a few species whose codon-level CUB values have been integrated, but inconvenient for other species whose codon-level values are not implemented. codonR (http://people.cryst.bbk.ac.uk/∼fdosr01/tAI/) can compute tAI for Escherichia coli, but for other species one need specify tRNA abundance in a not well-documented format, causing inconvenience. DAMBE (Xia, 2013) is flexible in calculating CAI by accepting user-specific codon tables, but this flexibility can be further improved (see below).
To advance codon usage studies, I have developed a new software package CUA, short for Codon Usage Analyzer. The package implements the four popular CUB metrics and is flexible in incorporating user-specific data.
Implementation and usage
The package provides both PERL modules and ready-to-use programs, and can be easily integrated into large-scale data analysis pipeline. Users can write new programs based on the modules or use the provided programs to accomplish their work. The package should work in any operation system with PERL installed. The package is deposited into Comprehensive Perl Archive Network (CPAN) at http://search.cpan.org/dist/Bio-CUA/, and can be installed by typing ‘cpan Bio::CUA’ in the command line.
Overall, the package comprises three main modules: Bio::CUA::CUB::Builder, Bio::CUA::CUB::Calculator and Bio::CUA::Summarizer. The former two modules compute CUB metrics at the codon and sequence levels, respectively, while Bio::CUA::Summarizer supports the computation by providing common functions of sequence processing. See http://search.cpan.org/dist/Bio-CUA/ for more modules included in the package.
One main advantage of CUA is its flexibility in setting user-specific parameters. For example, it can use tRNA abundance of any species in a simple format for computing tAI. It also accepts both codon tables and sequences for CAI calculation, and allows specifying user-specific highly expressed genes other than using a general reference set. It is also flexible in specifying genetic code tables. In the following paragraphs, I will demonstrate some of its flexibilities by giving an example of calculating CUB metrics for genes in the fruitfly Drosophila melanogaster. The commands and data are listed in supplementary Text S1.
First, I compute the codon-level tAI and CAI using the programs tai_codon.pl and cai_codon.pl, respectively. For tAI, I download tRNA copy numbers from GtRNAdb (http://gtrnadb.ucsc.edu/) (Chan and Lowe, 2009) for D. melanogaster, sum them up for each anticodon (tRNA abundance can be used instead if available), and use it as input to the program tai_codon.pl for computing the codon-level tAI. For CAI, I extract the top 200 highly expressed genes as the reference gene set (as in Qian, et al., 2012) based on mRNA expression levels from the Drosophila S2 cell line (Zhang, et al., 2010), download protein-coding sequences from FlyBase (St Pierre, et al., 2014), and use these two datasets as input to the program cai_codon.pl for computing the codon-level CAI.
Second, I compute the sequence-level CUB metrics using the program cub_seq.pl. I feed the program with the codon-level tAI and CAI calculated above and also specify the ENC option, so that the sequence-level tAI, CAI and ENC are calculated along with other parameters such as counts of amino acids and GC content. Users can also refer to the tutorial at http://search.cpan.org/dist/Bio-CUA/lib/Bio/CUA/Tutorial.pod for computing other CUB metrics.
CAI and ENC variants
In addition to the original CAI (Sharp and Li, 1987), CUA also implements two CAI variants, mCAI and bCAI. They differ from the original one in normalizing relative synonymous codon usage (RSCU) (Sharp and Li, 1987). mCAI and bCAI take into account RSCUs expected from even codon usage and from the background data (e.g., RSCUs of lowly expressed genes), respectively. See Text S2 for details. I compare these CAI variants by correlating them with mRNA expression and translation efficiency (determined by ribosome profiling technique (Ingolia, et al., 2009)) using the data of the S2 cell line (Dunn, et al., 2013; Zhang, et al., 2010). The better metric is expected to correlate more strongly. As shown in Table 1, bCAI correlates with mRNA expression and translation efficiency more strongly than CAI and mCAI do, and the latter two perform somewhat equivalently. Therefore, bCAI performs better when predicting mRNA expression levels or translation efficiency, although only slightly because most comparisons are not statistically significant.
For ENC, CUA implements both the original (Wright, 1990) and nucleotide-composition-corrected ENC (Novembre, 2002), and tries to estimate missing F values using a different way (see Text S2 for details). I compare these ENC variants using the protein coding genes in D. melanogaster and using the nucleotide compositions from each gene’s introns to correct nucleotide compositions. The results show that the method for estimating missing F values has little effect on ENC values because most sequences are long enough to obtain all F values, and that, instead, correcting nucleotide composition has strong effect on ENC values as the corrected and non-corrected ENCs are moderately correlated (Fig. S3). Surprisingly, we find the corrected ENCs generally perform worse than non-corrected ones in terms of correlation with mRNA expression levels and translation efficiency (Table 2), though ENC is designed only to measure codon usage unevenness other than to predict mRNA expression or translation efficiency.
Conclusions and Future Directions
In sum, CUA is flexible in incorporating users’ data and able to compute all popular CUB metrics for any species, which will facilitate CUB studies.
Additionally, I plan to implement more CUB metrics and add a graphic user interface based on users’ feedback. I also expect expanding the package by more contributors because of the great collaborating environment through CPAN.
Competing Interests
The author declares there are no competing interests.
Funding
This work is supported by funds from the David & Lucile Packard Foundation and the University of Rochester.
Acknowledgements
I am very grateful to Dr. Daven Presgraves for supporting this project and also thank Dr. Harry Stern and other Bluehive staffs at University of Rochester for their excellent support for computation.