PT - JOURNAL ARTICLE AU - Santiago Herrera AU - Paula H. Reyes-Herrera AU - Timothy M. Shank TI - Genome-wide predictability of restriction sites across the eukaryotic tree of life AID - 10.1101/007781 DP - 2014 Jan 01 TA - bioRxiv PG - 007781 4099 - http://biorxiv.org/content/early/2014/08/08/007781.short 4100 - http://biorxiv.org/content/early/2014/08/08/007781.full AB - High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes - generally known as restriction-site associated DNA sequencing (RAD-seq) - is now one most commonly used strategies to generate single nucleotide polymorphism data in eukaryotes. The choice of restriction enzyme is critical for the design of any RAD-seq study as it determines the number of genetic markers that can be obtained for a given species, and ultimately the success of a project.In this study we tested the hypothesis that genome composition, in terms of GC content, mono-, di- and trinucleotide compositions, can be used to predict the number of restriction sites for a given combination of restriction enzyme and genome. We performed systematic in silico genome-wide surveys of restriction sites across the eukaryotic tree of live and compared them with expectations generated from stochastic models based on genome compositions using the newly developed software pipeline PredRAD (https://github.com/phrh/PredRAD).Our analyses reveal that in most cases the trinucleotide genome composition model is the best predictor, and the GC content and mononucleotide models are the worst predictors of the expected number of restriction sites in a eukaryotic genome. However, we argue that the predictability of restriction site frequencies in eukaryotic genomes needs to be treated in a case-specific basis, because the phylogenetic position of the taxon of interest and the specific recognition sequence of the selected restriction enzyme are the most determinant factors. The results from this study, and the software developed, will help guide the design of any study using RAD sequencing and related methods.