Abstract
In non-model organisms, evolutionary questions are frequently addressed using reduced representation sequencing techniques due to their relatively low cost, ease of use, and because they do not require genomic resources such as a reference genome. However, evidence is accumulating that many such techniques may be affected by specific biases, questioning the accuracy of obtained genotypes, and as a consequence, their usefulness in evolutionary studies. Here we introduce three strategies to assess genotyping error rates in such data: through the comparison with high quality genotypes obtained with a different technique, from independent replicates of some samples, or from a population sample when assuming Hardy-Weinberg equilibrium. Applying these strategies to data obtained with Restriction site Associated DNA sequencing (RAD-seq), arguably the most popular reduced representation sequencing technique, revealed per-allele genotyping error rates that were much higher than sequencing error rates, particularly at heterozygous sites that were wrongly inferred as homozygous. As we exemplify through the inference of genome-wide and local ancestry of well characterized hybrids of two widespread and intensively studied Eurasian poplar (Populus) species, such high error rates may easily lead to wrong biological conclusions. By properly accounting for these error rates in downstream analyses, either through the incorporation of genotyping errors directly, or by recalibrating genotype likelihoods, we were nevertheless able to use the RAD-seq data to support biologically meaningful and robust inferences of ancestry among Populus hybrids.
Introduction
Despite the impressive advancements in sequencing techniques and the decrease of related costs, whole genome sequencing (WGS) remains prohibitively expensive when working with a large number of samples or species with large genomes. Since many applications do not require information on the whole genome, reduced representation sequencing techniques are valuable alternatives and have become widely used for genome-wide SNP discovery and genotyping, especially in species with poor genomic resources (Narum et al. 2013; Andrews et al. 2016).
A commonly used reduced representation sequencing technique is Restriction site Associated DNA sequencing (Miller et al. 2007; Baird et al. 2008), which allows the sequencing of massively multiplexed samples at minimal costs by focusing on the sequences adjacent to restriction sites. Since restriction sites are often shared between individuals within a species and often also between closely related species (Cariou et al. 2013), focusing on adjacent sequences guarantees that sequenced loci are mostly overlapping across samples.
Briefly, the first step in the original RAD-seq protocol is the digestion of genomic DNA with a restriction enzyme. The resulting fragments are ligated to an adaptor and a unique barcode for each sample, and multiple individuals are pooled. The fragments are then sheared using a sonicator and those showing the proper size are selected and amplified through polymerase chain reaction (PCR). At this point the library is suitable for sequencing. By focusing the sequencing effort on tagged restriction sites, rather than on all randomly sheared genomic fragments (Rowe et al. 2011; Arnold et al. 2013), the number of markers can be customized through the choice of restriction enzymes. This choice will also influence which features of the genome are sampled, since certain enzymes preferentially cut in exonic regions, while others target intergenic and intronic regions (Arnold et al. 2013; Pootakham et al. 2016).
Several alternative RAD-seq protocols allow for ample customization of this methodology. These include the elimination of the sonication step (Andrews et al. 2016), ddRAD (Peterson et al. 2012), which uses two restriction enzymes rather than one, 2bRAD (Wang et al. 2012), which uses IIb-type restriction enzymes and produces fragments of 36 bp, and ezRAD (Toonen et al. 2013), in which DNA is digested with isoschizomers (a pair of restriction enzymes recognizing the same sequence). However, each method has different advantages and pitfalls, and a specific protocol may be more suitable for some applications than for others (Puritz et al. 2014). Due to this versatility, RAD-seq has been used in diverse applications, including the study of the genomics of adaptation (Andrews et al. 2016), hybridization and speciation (Marques et al. 2016), inbreeding depression (Hoffman et al. 2014), genetic associations (Nadeau et al. 2014), genetic mapping (Chutimanitsakun et al. 2011) and phylogeographic and phylogenomic analyses (Emerson et al. 2010; Leaché et al. 2015).
Despite this widespread use, genotypes called from RAD-seq data have been associated with several biases, many of which are specific to RAD-seq. Several major biases potentially affect alleles differently, which may lead to their unequal representation in sequencing data, and hence to genotyping errors at heterozygous sites (Davey et al. 2013). For instance, polymorphisms occurring in the restriction site may result in one allele not being cut and therefore not sequenced, potentially causing linked sites to be erroneously called homozygous (“allele dropout”). Polymorphisms at neighbouring restriction sites may also result in genotyping biases, for example, if the length of the fragment of one allele falls short of the selected size range (Andrews et al. 2016). Yet size differences among longer fragments were also found to result in unequal sequencing depth at linked sites because sonicators shear shorter fragments less efficiently than longer fragments (Sambrook & Russell 2006). Finally, the PCR step present in most RAD-seq protocols may contribute to genotyping errors through unequal amplification of the two alleles (Casbon et al. 2011) or through so-called PCR duplicates, the sequencing of misleading clonal copies of the same initial molecule (Andrews & Luikart 2014). Since many protocols produce single-end libraries or libraries where both ends are defined by restriction sites (e.g. ddRAD), PCR duplicates cannot be reliably identified bioinformatically unless very many different adapter sequences are used (Schweyen et al. 2014). This is particularly problematic in the case of PCR errors that might be sequenced in many copies, resulting in wrongly called heterozygous genotypes (Andrews et al. 2016).
Some consequences of these biases in downstream analyses are well documented. Gautier et al. (2013), for instance, found that under certain circumstances allele dropout leads to incorrect estimates of genetic diversity. Arnold et al. (2013) demonstrated through both simulations and real data that estimates of summary statistics commonly used to infer diversity and past demography from RAD data are severely affected by missing haplotypes and may show strong deviations from true values. Cariou et al. (2016), finally, illustrated that allele dropout can lead to underestimation of diversity, especially in highly polymorphic species.
In light of these potentially common issues, our aims here were to develop strategies (1) to estimate genotyping error rates in RAD data, and (2) to properly incorporate the resulting genotyping uncertainty in downstream analyses to mitigate the consequences of errors. For this we present methods to estimate genotyping errors in RAD-seq data in three different ways: First, by taking advantage of available genotyping data based on a different, more reliable method (e.g. using a chip or high-depth sequencing). Second, by using independent RAD-seq replicates of individuals. And third, by assuming Hardy-Weinberg proportions among population samples. Using simulations we show that all these methods are powerful in inferring error rates even if limited samples are available. We then applied these methods to RAD-seq data of the two widespread, genetically and ecologically divergent tree species Populus alba (White poplar) and P. tremula (European aspen) and inferred high genotyping error rates of multiple percent. By properly accounting for genotyping uncertainty, however, we obtain biologically meaningful estimates of genome-wide and local ancestry.
Materials and Methods
Estimation of genotyping error rates
Let us denote by gil the observed genotype of individual i = 1, …, I at locus l = 1, …, L, where gil = 0, 1, 2 reflects the number of copies of the alternative allele at a bi-allelic locus. Given per-allele genotyping error rates ε0 and ε1 at homozygous and heterozygous sites, respectively, the probabilities P(gil|ε0, ε1) of observing genotype gil are given in Table 1. We next present three strategies to estimate the genotyping error rates ε0 and ε1 from called genotypic data.
From a Truth Set
Consider a set of accurate genotypes γil obtained independently for a common set of individuals and loci. Assuming all γil to be correct and genotyping errors to be independent between sites and individuals, the likelihood of the observed genotypes g = {g11, …, gI1, …, gI} is then given by where P(gil|γil, ε0, ε1) is given in Table 1 and and γ = {γ11, …, γI1, …, γIL}. We obtain maximum likelihood estimates of ε0 and ε1 through numerical maximization (see Supplementary Methods).
From Individual Replicates
Consider a set of individuals for which multiple independent sequencing experiments were conducted. Let us denote by the inferred genotype of individual i = 1, …, I at locus l = 1, …, L in replicate j = 1, …, ri. The likelihood of the full data g is then given by where γ denotes the unobserved true genotype, P (γ|fiγ) = fiγ denotes the frequency of genotype γ among all loci of individual i and P(gil|γ, ε0, ε1) is given in Table 1. We obtain maximum likelihood estimates of the parameters ε0, ε1 and f = {f10, …, f12, …, fI2} with an EM algorithm as detailed in the Supplementary methods.
From Population Samples
Consider a set of individuals i = 1, …, I sampled from a random mating population such that the distribution of the true genotypes at loci l = 1, …, L are well described by Hardy-Weinberg proportions. While the allele frequencies fl are unknown, let us assume that they follow a Beta distribution with parameters α, β such that fl ~ Beta(α, β), as is expected under neutrality (Wright 1931). The likelihood of the full data is then given by where the sum runs over the unknown true genotype γ, P(γ|f) are the Hardy-Weinberg proportions and P(gil|γ, ε0, ε1) is given in Table 1. To obtain estimates under this model we resort to an MCMC approach under a Bayesian scheme (see Supplementary methods) with exponential priors ε0, ε1 ~ Exp(λ) truncated at 0.5 and normal priors log(α), log(β) ~ N (μ, σ2). We used λ = 5, μ = log(0.5) and σ2 = 0.25 throughout.
Error rate classes
All above methods are readily extended to jointly infer error rates for multiple classes, such as bins of sequencing depth or groups of samples if libraries were prepared in multiple experiments. Inferring the error rates of all classes jointly is beneficial in the case of individual replicates or population samples, as information about hierarchical parameters such as individual genotype frequencies fiγ or the parameters α, β of the Beta distribution are shared across classes. Here, this allows us to infer error rates of multiple bins of sequencing depth.
Recalibrating genotype likelihoods
We recalibrate genotype likelihoods by treating obtained genotype calls gil as data and determining the likelihoods P(gil|γil, ε0, ε1) for all γil = 0, 1, 2 according to Table 1 and using parameter estimates ε0 and ε1 obtained for the relevant error rate class. If a truth set is available, we also calculate the empirical likelihoods P(gil|γil = g) across all loci with γil = g of a particular error rate class.
Implementation
We implemented all algorithms developed here in the open-source C++ program Tiger (Tools to Incorporate Genotyping ERrors), available through the git repository at https://bitbucket.org/wegmannlab/tiger.
Simulations
We used simulations to assess the power of the methods introduced above to infer genotyping error rates. All simulations were generated directly under the assumed model using routines we implemented in Tiger. Under the truth set or replicate model, we quantified the power separately for homozygous and heterozygous sites. This was not possible under the Hardy-Weinberg model, for which we draw true allele frequencies from a Beta distribution with parameters α = β = 0.7, implying about 29% of heterozygous genotypes.
Application to Populus species
Study system and plant material
We generated RAD-seq data of 139 individuals of the two widespread tree species Populus alba and P. tremula, and their hybrids (P. x canescens) in two sets. The first set consisted of 136 individuals (Supplementary Table S1) grown with minimal interference in a common garden established at the University of Fribourg (Switzerland) and previously genotyped by Lindtke et al. (2014). All these individuals grew from seeds collected from 15 mother trees in a natural hybrid zone in the Parco Lombardo della Valle del Ticino in Northern Italy where individuals of the two species and their hybrids grow side by side (Lindtke et al. 2012; Christe et al. 2016).
The second set consisted of four individuals for which we generated multiple replicates: a hybrid individual (F039_05) also included among the samples of the first set, a second hybrid individual (I373_A) also from the Ticino hybrid zone but grown in a common garden in Salerno, Italy, a pure P. alba individual (J1) from the Jalón river in the Ebro watershed (Northeast of the Iberian Peninsula), an assumed F1-hybrid tree (BET) from a population in the Tajo river headwaters (Central Iberian Peninsula). The two Iberian individuals were previously genotyped using microsatellites (Macaya-Sanz et al. 2011).
DNA extraction and RAD sequencing
For all samples, DNA was extracted from 15-20 mg of silica-dried leaf material with the Qiagen DNeasy Plant Mini Kit (Valencia, CA). The concentration of DNA was measured with a Qubit 2.0 Fluorometer using the dsDNA HS assay kit (Invitrogen), and its integrity verified with electrophoresis on 1.5% agarose gels (1X TBE). Concentrations were standardized to 20 ng/μl and individual samples were submitted for library preparation and Restriction site Associated DNA sequencing (RAD-seq) to Floragenex (Eugene, OR). There, all extractions of the individuals of the first set, as well as the replicate extractions of F039_05 and I373_A, were processed (together with additional samples prepared in the same way), in five libraries of 95 individuals each. These libraries were prepared according to Floragenex’ standard commercial protocol: genomic DNA was digested with the restriction endonuclease PstI (chosen according to previous studies on these species – Stölting et al. 2013; Christe et al. 2016) and RAD libraries were prepared with a method similar to the one described in Baird et al. (2008). This protocol included 18 PCR cycles, after which DNA fragments ranging from 300 to 500 bp were retained. All five libraries were sequenced in a single run on an Illumina HiSeq2500 instrument, but on individual lanes.
Following the same protocol, an additional library was generated and sequenced by Floragenex in a separate experiment, consisting of two and three replicate extractions of J1 and BET, respectively, as well as extractions from offspring of a controlled cross between them.
Bioinformatic data processing
We assigned reads to individuals or replicates with fastq-multx (ea-utils; Aronesty 2011), allowing one mismatch in the 15 bp including barcode and restriction site. Read quality was checked with FastQC 0.10.1 (Andrews 2010) and low quality bases and reads were removed with condetri v.2.3 (Smeds & Künstner 2011) using default parameters, except for the options -hq (high quality threshold) and -lfrac (maximum acceptable fraction of bases after quality trimming with quality scores lower than the threshold -lq), for which a value of 15 and 0.1 were chosen, respectively.
Good quality reads were aligned against the P. tremula mitochondrial reference sequence (Kersten et al. 2016) and against the nuclear reference genome of P. trichocarpa (Ptrichocarpa_210_v3.0; Tuskan et al. 2006) using Bowtie2 2.3.0 (Langmead & Salzberg 2012) with “end-to-end” and “very sensitive” settings. Reads with mapping quality lower than 20 were discarded using samtools 1.3 (Li et al. 2009) and read group information was added with picard tools 1.139 (http://broadinstitute.github.io/picard). We then used the tools TargetCreator and IndelRealigner of GATK 3.8 (DePristo et al. 2011) to realign around indels, and recalibrated base quality scores for each individual using the method by Kousathanas et al. (2017) implemented in ATLAS (Link et al. 2017) on mitochondrial sequences. This method does not require a priori information on genotyping information and instead learns base qualities from haploid regions while integrating over genotype uncertainty. Finally, we called genotypes with UnifiedGenotyper in GATK 3.8 (DePristo et al. 2011).
To then only retain reliable sites for comparison, we filtered resulting variants using vcftools (Danecek et al. 2011) and custom R scripts: first, we removed sites with an average depth across individuals ≥24 (the 98.7% quantile) to exclude potentially paralogous loci. Second, we only kept variants with at most two segregating alleles. Third, we removed indels and variant sites within 5 bp of an indel to avoid Single Nucleotide Variants (SNVs) originating from misalignments.
Truth Set
Genotypes of the 136 individuals grown in the common gardens were previously obtained (Lindtke et al. 2014) with a genotyping-by-sequencing (GBS) protocol very similar to the ddRAD protocol (Peterson et al. 2012) and using the restriction enzymes EcoRI and MseI. Importantly, Lindtke et al. (2014) generated sequencing data also for the 15 mother trees and used sibships in a Bayesian approach to infer genotypes while accounting for familial relationships.
To compare these high quality genotypes to those obtained from our own RAD-seq experiment, loci covered in both studies had to be identified first. Since Lindtke et al. (2014) used an older P. tremula reference, we extracted from this reference windows of 201 bp around each locus in the GBS data set (100 bp on either side). We then mapped these extracted sequences against the P. trichocarpa reference with Bowtie2 2.3.0 (Langmead & Salzberg 2012) with “end-to-end” and “very sensitive” settings and retained only those sequences that mapped uniquely with quality of 20 or more. We then kept all loci overlapping between the two data sets, but removed four loci for which different alternative alleles were called. To ensure high accuracy of the GBS data, we restricted all comparisons to genotypes called with a posterior probability ≥ 99% by Lindtke et al. (2014).
Estimation of genome-wide and interspecific ancestry
We estimated genome-wide (q) and interspecific (Q12) ancestry for the 136 common garden samples using entropy (Gompert et al. 2014), a program that implements a model similar to the admixture model in structure (Pritchard et al. 2000). In contrast to structure, however, entropy can also make use of uncertain genotypes from low depth sequence data by working directly with genotype likelihoods, rather than genotype calls. Here we ran entropy on the raw genotype likelihoods, as well as on genotype likelihoods recalibrated using empirical likelihoods, excluding sites with >50% missing data in both cases. To stratify the estimates and have sufficient observations to estimate these probabilities reliably, we considered five RAD-seq depth classes: 1-3, 4-7, 8-15, 16-31 and ≥32.
Inference of locus-specific ancestry
To infer locus-specific ancestry, we ran RASPberry (Wegmann et al. 2011), which implements a Hidden Markov Model (HMM) to explain haplotypes of admixed individuals as a mosaic of provided reference haplotypes for each species. We obtained suitable reference haplotypes by phasing previously characterized pure P. alba and pure P. tremula individuals (51 each) from the Italian, Austrian and Hungarian hybrid zones (Christe et al. 2016) using FastPhase (Scheet & Stephens 2006), building input files with fcGENE (Roshyara & Scholz 2014). For use in RASPberry, individuals in the reference panels were not allowed to have missing data. We thus restricted the comparison to only the SNVs covered in all parental individuals.
To compute HMM transition probabilities in RASPberry, we used a default recombination rate of 5 cM/Mb as estimated by Tuskan et al. (2006) in P. trichocarpa and the estimates of the genome-wide ancestry q for each sample obtained with entropy from the error corrected data. For most other parameters we used previous estimates for P. alba and P. tremula hybrid zones (Christe et al. 2016), but scaled these as proposed by Wegmann et al. (2011) to reflect the size of the reference panel. These include the ancestral population recombination rates (315 and 900 for P. alba and P. tremula, respectively), mutation rates (0.00185 and 0.00349, respectively) and the miscopying rate (0.01). However, we set the time since admixture to five (rather than one) to reflect the different sampling strategy.
To account for genotyping errors, we estimated a per-allele genotyping error ε under the truth-set model with the constraint ε0 = ε1 = ε, and then added this estimate to the miscopying rate and the two mutation rates. Under the RASPberry copying model, these parameters control the rate at which the sample genotypes differ from the reference haplotype from which the sample is copying. That rate thus depends on the reference panel size, but also on genotyping errors.
We called ancestry segments as any stretch on a chromosome within which the posterior probabilities for a particular ancestry (homozygous P. alba, heterozygous ancestry or homozygous P. tremula) was > 0.5 at all SNVs, and measured its length from the first to the last SNV.
Results
Power to infer genotyping error rates
Simulations suggest that a few thousand loci are sufficient to accurately estimate genotyping error rates even from just a few samples (Figure 1). Due to the extra information provided, the smallest estimation errors were obtained when using a truth set: >90% of all estimates fell within a range from half to two-fold the true value (Q2, e.g., within [0.005, 0.02] for a true error rate of 0.01) if estimated genotypes were compared at 100 truly homozygous and 200 truly heterozygous genotypes for ε0 and ε1, respectively.
Similar accuracy was achieved under the Hardy-Weinberg model as soon as 5,000 sites were used. However, the accuracy is a function of the fraction of truly heterozygous genotypes in the data set, with the accuracy of ε0 being much higher than for ε1 if much fewer than 50% of all genotypes are heterozygous, and vice versa. Here, we simulated about 29% heterozygous genotypes and thus expect accuracy to be higher if a larger fraction of genotypes were heterozygous.
The lowest accuracy was observed under the replicates model, especially if error rates were low. Using 104 comparisons, for instance, all estimates were within Q2 for a true value of ε0 = ε1 = 0.1, but only slightly above 70% of all estimates for a true value of ε0 = ε1 = 0.01. This is readily explained by the fact that only very limited information about the true genotype is available: if the two replicates differ in their genotype, it is not clear which one is correct. Consequently, the accuracy of inference is much increased if more than two replicates are available per individual (Supplementary Figure S1). Nonetheless, even small error rates can be estimated relatively accurately, as >90% of all estimates for true value of ε0 = ε1 = 0.01 fell within Q2 as soon as 5·104 or more comparisons were used. Assuming 20% of all considered genotypes to be heterozygous, around 105 sites are required if two pairs of replicates are used.
High genotyping error rates in RAD-seq
We next used our inference methods to quantify genotyping errors from our RAD-seq experiment of 137 individuals of the two widespread tree species Populus alba and P. tremula, and their hybrids (P. x canescens). On average, our experiment resulted in 831,160.66 (sd 153,433.62) reads per sample that passed quality trimming and mapped against the reference genome of P. trichocarpa with mapping quality ≥20. From those, we called 529,305 Single Nucleotide Variants (SNVs), after removing multi-allelic sites, those with excess depth, indels and variant sites around indels. We estimated per-allele genotyping error rates from these SNVs through a comparison with previously published, high-quality genotypes (truth set), and from multiple replicate libraries sequenced for a subset of our samples (replicates).
Truth set
We estimated per-allele genotyping errors by comparing genotype calls from our RAD-seq experiment to those of a previously published GBS dataset (Lindtke et al. 2014) for 136 individuals present in both studies. In total, 7,426 SNVs overlapped between experiments, at which we could use a total of 16,610 genotype comparisons. Of those, only 69.9% matched, with matching rates increasing with RAD-seq depth (Figure 2A). Strikingly, RAD-seq genotypes were much less often heterozygous than GBS genotypes (Figure 2B), especially at low depth. In line with these observations, we inferred per-allele genotyping error rates > 10% whenever RAD-seq depth was ≤ 35 and when using a model assuming a single error rate (ε0 = ε1), driven by an exceptionally high error rate at truly heterozygous sites ε1 (Figure 2C).
These are surprisingly high error rates, particularly when considering that sequencing error rates of Illumina machines are estimated at < 1% (Nielsen et al. 2011). Importantly, the bias towards homozygous genotype calls is not simply explained by low depth. Indeed, RAD-seq still resulted in less then half as many heterozygous calls at depths ≥ 40x, which are usually considered more than sufficient for accurate genotype calling (Nielsen et al. 2011). Instead, our results suggest an inherent bias in the RAD-seq data analyzed here.
However, our estimates rely on the assumption that the GBS data reflect true genotypes. This is based on good evidence: First, Lindtke et al. (2014) additionally sequenced the mother trees of all individuals considered here and estimated posterior genotypes using a hierarchical ancestry model that incorporated familial relationships with mothers and among siblings. These updated estimates correlated with the raw maximum likelihood genotype estimates ignoring familial data at 0.985 and differed from those in < 0.02% of all calls. Second, we restricted this comparison to GBS genotypes with a posterior probability ≥ 99%. Third, the fraction of concordant genotype calls between the GBS and RAD-seq data increased with RAD-seq depth. If the mismatches were driven by errors in the GBS data, no such dependence should be observed.
Replicates
We next benefitted from two sets of replicate libraries to estimate per-allele genotyping error rates. The first set consisted of two replicate libraries of each of two individuals (F039_05 and I373_A) sequenced along all other samples. Error rates estimated from that data corroborated the conclusion obtained from the comparison with GBS data (Figure 2D): error rates at truly homozygous sites (ε0) were on the order of 1% or less, and those at truly heterozygous sites (ε1) were equal or close to 50%, which is the largest value possible under our model.
In contrast to the estimates obtained in comparison to GBS genotypes, the error rates at truly heterozygous sites (ε1) dropped to about 20% at high depth (≥ 20x). This difference might in part be driven by errors in the GBS data slightly inflating error rate estimates. However, given the high quality of the GBS data, it appears more likely that the error rates from replicates are underestimated. Polymorphisms in restriction cut sites or unequal PCR amplification rates of alleles, for instance, affect replicates systematically, while the statistical inference must assume independence of errors between replicates.
To verify that high error rates are not specific to the RAD-seq run performed on these hybrids, we carried out a second RAD-seq experiment including two and three replicates of a pure P. alba individual (J1) and a putative F1 P. alba × P. tremula hybrid (BET), respectively. This experiment resulted in 694,030.80 (sd 187,957.33) reads per sample that passed quality trimming and mapped against the reference genome of P. trichocarpa with quality ≥20.
Per-allele error rates estimated from this data were indeed lower, with error rates at truly homozygous sites (ε0) on the order of 0.2% and those at truly heterozygous sites (ε1) starting out at 50% and dropping to about 3% at a depth of 25x (Figure 2D). However, these lower errors still point to a particular issue in calling heterozygous genotypes: of all truly heterozygous sites with depth ≥ 25x in our data, > 5% are expected to be called homozygous. (The lower output of this sequencing experiment does not allow us to make reliable statements for higher depths).
Estimation of genome-wide and interspecific ancestry
We investigated the impact of genotyping errors in our RAD-seq data on the inference of genome-wide ancestry q as well as interspecific ancestry Q12, which reflects the proportion of loci of heterospecific ancestry. We estimated these ancestry components with entropy from 230,805 SNVs, and compared them to estimates from GBS obtained by Lindtke et al. (2014).
The model implemented in entropy accounts for the genotyping uncertainty reflected in the genotype likelihoods. However, the raw genotype likelihoods obtained from our RAD-seq data are misleading: for sites with considerable depth, the RAD-seq genotype likelihoods often suggest almost certainty for wrong genotypes. Of all genotypes wrongly called as homozygous at depth ≥30x (judged by the comparison to GBS data), 90% had a variant quality of 77 or more. (A variant quality of 77 implies that it is more than 5 billion times less likely to observe the obtained data from a heterozygous than homozygous site).
As a result, ancestry estimates differed considerably between the GBS and RAD-seq data sets (Figure 3). Interestingly, the estimates of the genome-wide ancestry q were much less affected by genotyping errors than the estimates of the interspecific ancestry Q12. This is readily explained, however, by the directionality of the most common error, which is to wrongly infer homozygous genotypes at heterozygous sites. We found these errors to result more frequently in a homozygous reference than homozygous alternative call (65.0%), particularly at low depth (84.5% at depth ≤5x, 49.9% at depth ≥20x). But this did not introduce a bias in q towards one of the species since we were using the reference sequence of the outgroup P. trichocarpa. The estimates of Q12, however, are very sensitive to an underestimation of heterozygosity.
To improve these estimates, we propose to directly account for the elevated genotyping error rates in RAD-seq data by adjusting the genotype likelihoods according to the observed genotyping uncertainty. By treating the genotype calls as data, we can determine the probabilities P(g|γ) of observing a RAD-seq genotype call g given the true genotype γ either by using estimates of the per-allele error rates ε0, ε1, or empirically from the comparison to a truth set. Using the latter approach on our data (individually for each depth) resulted in estimates of q and Q12 that were much closer to those obtained by Lindtke et al. (2014; Figure 3).
Some differences between the point estimates of Q12 remain, likely due to differences in the models and their information-sharing among individuals (in Lindtke et al. (2014), information about maternal plants and sibships was part of the model) and the extent to which information in uncertain genotypes was outweighed by hierarchical prior probabilities related to ancestry. Nonetheless, these results demonstrate the importance of accounting for uncertainty in genotyping data since the estimates of interspecific ancestry with and without correction lead to a very different biological interpretation: With correction, the hybrid individuals appear to be mostly early generation hybrids, suggesting meaningful reproductive isolation between the species. Without correction, the large number of individuals with low Q12 but intermediate values of q suggest considerable gene flow between the species, an interpretation at odds with recent work (Macaya-Sanz et al. 2011; Christe et al. 2016, 2017).
Inference of locus-specific ancestry
We next evaluated the impact of genotyping errors on local ancestry inference. Hybrid zones between P. alba and P. tremula are dominated by pure parental individuals and early hybrids (mostly F1), with only few adult recombinant hybrids (Lindtke et al. 2014; Christe et al. 2016). We chose ten individuals among our samples spanning that spectrum according to q and Q12 values from Lindtke et al. (2014): a putatively pure P. alba individual (F039_01), a putatively pure P. tremula individual (F030_01), two putative backcrosses to P. alba (F020_04 and F032_08), two putative backcrosses to P. tremula (I345_02 and I345_03), two putative F1 hybrids (I373_03 and F030_05) and two putative hybrids of later generations (Fn – F022_03 and F026_05). We then used RASPberry (Wegmann et al. 2011) to infer local ancestry along chromosomes 1 through 5, restricting our inference to the 6,445 SNVs that did not have missing data in the parental reference haplotypes we took from Christe et al. (2016).
In line with these expected simple ancestry make-ups, we inferred many large ancestry blocks often spanning almost entire chromosomes (Figure 4). Surprisingly, however, we inferred most of these blocks to be of homospecific ancestry, and also inferred many short segments, which is difficult to reconcile with the putative ancestries of our samples (Figure 4). As an example, consider the individual I373_03 in Figure 4 that was classified as an F1 hybrid by Lindtke et al. (2014), but for which we inferred homospecific ancestry blocks for both parental species. Such artifacts could arise from the reference panels being too small to properly reflect the haplotypes found in our hybrid individuals, large gaps between neighboring SNVs limiting the power of the HMM implemented in RASPberry, but most likely by genotyping errors towards homozygous genotypes.
While RASPberry does not account for genotyping uncertainty via genotype likelihoods, the implemented copying-model allows for “mutations”, or differences between the observed genotype of an admixed individual and the reference haplotypes from which it is copying. We thus repeated the inference by adding an average per-allele genotyping error rate of 13.7% (weighted average across depth ≥5x) to the mutation rate parameters to account for the high genotyping error in our RAD-seq data (Figure 4). Accounting for genotyping errors indeed improved our estimates. For the putative backcrosses, for instance, we called fewer segments homozygous for the “wrong” ancestry (11 versus 34) and these covered a smaller fraction of the first five chromosomes (6.1% versus 11.1% of the parts at which ancestry could be called). Similarly, we called a higher fraction of the first five chromosomes to be of heterozygous ancestry (45.0% versus 37.2% of the parts at which ancestry could be called). While these results corroborate the importance of accounting for the true uncertainty in genotypes in downstream analysis, they also illustrate that a method accounting for a uniform error fails to fully mitigate the bias against heterozygous genotypes present in our RAD-seq data sets.
Discussion
Here, we report high genotyping error rates in two independent RAD-seq data sets. We obtained these estimates by comparing RAD-seq calls of two independent experiments, either to published genotype calls for the same individuals (Lindtke et al. 2014) obtained with a different sequencing method (GBS), or to calls obtained from independent replicates of the same individuals. Both approaches provide evidence for high per-allele genotyping errors of several percent and show that RAD-seq has a strong bias towards calling homozygous genotypes at heterozygous sites that is not overcome with higher sequencing depth.
Only few studies have reported estimates of genotyping errors of reduced representation techniques to date, but all agree with the high estimates obtained here. Luca et al. (2011), for instance, compared genotypes of human samples obtained with a technique similar to RAD to those available in a public database, and estimated that between 6.3 and 9.7% of heterozygous sites were called as homozygous. Similarly, Mastretta-Yanes et al. (2015) found that, depending on the parameter settings chosen for the de-novo assembly, between 5.9 and 8.8% of alleles were not concordantly called between replicates.
Several factors could explain the high genotyping error and the lack of heterozygous genotypes in RAD-seq data. For example, one allele might not have been sequenced or sequenced only at very low depth because of differences in fragment length that can lead to amplification bias, less efficient shearing, or loss in size selection. This is a likely explanation, since Davey and colleagues (2013) showed that there is a high correlation between read depth and fragment length. Similarly, differential efficiency of PCR among alleles could have masked one allele (i.e., PCR duplicates), causing it to be represented in a very low number of reads (Schweyen et al. 2014). Finally, the well-known issue of allele dropout due to polymorphisms in the restriction site and the “loss” of one allele at heterozygous sites may have contributed to the inaccurate, low observed heterozygosity (Davey et al. 2013; Puritz et al. 2014; Andrews et al. 2016). However, this problem is a less likely explanation as it should not affect RAD-seq at a higher rate than the double-digest GBS protocol we used for the error estimation.
Several bioinformatic solutions have been suggested to mitigate the apparent biases in RAD-seq. Both Arnold et al. (2013) and Gautier et al. (2013) recommend the comparison of read depth across sites, to identify loci likely exhibiting allele dropout. In our case, however, depth varied substantially across sites, because of PCR duplicates or stochastic events, rendering such an approach difficult. Davey et al. (2013) also noted that alleles present in two copies at homozygous sites have higher depth compared to alleles present in single copy at heterozygous loci, but read depths for the two sets of alleles overlap, inhibiting the accurate detection of loci with allele dropout by using depth alone. To improve upon this, Cooke et al. (2016) developed a method to infer the likelihood of observing allele dropout at a site on the basis of the coverage of each sample, and suggested to ignore sites where this likelihood is high. Finally, it was also suggested to discard any locus with a missing genotype, since this might indicate a polymorphism in the restriction site. In many studies with moderate depth, including ours, the amount of missing data prevents the adoption of such drastic solutions. In summary, all these filtering suggestions result in a massive reduction in usable loci, and hence further accentuate the already limited genome-wide coverage of reduced library techniques such as RAD.
As a model-based alternative, we propose here to properly account for the high genotyping errors in downstream analysis. A first such attempt was recently proposed by Cariou et al. (2016), who developed an Approximate Bayesian Computation (ABC) method to estimate genetic diversity while accounting for allele dropout, but found this method not to be accurate under elevated levels of diversity. A more general solution, we believe, is to make use of the large number of recently developed tools that do not require genotype calls but rather work directly from genotype likelihoods to account for uncertainty in the data (Fumagalli et al. 2014; Korneliussen et al. 2014; Kousathanas et al. 2017; Jørsboe et al. 2017). Using such tools minimizes the necessity to filter data stringently and is readily applied to low-depth data (Nielsen et al. 2011).
However, for such methods to work properly, the genotype likelihoods need to accurately reflect the uncertainty in genotypes. While all modern genotype callers also calculate genotype likelihoods, these do not reflect biases specific to individual sequencing protocols such as RAD-seq, as we illustrate here, and must thus be recalibrated. Here we propose two recalibration strategies: If accurate genotype calls are available for a subset of the individuals and markers, empirical genotype likelihoods can be obtained by comparing those to calls from a reduced representation sequencing experiment. Alternatively, sample replicates may be used to infer per-allele genotyping error rates, from which recalibrated genotype likelihoods are readily calculated. Tools for both of these strategies are available through the software Tiger, which also accounts for sequencing depth as an additional covariate. While we found sequencing depth to be a particularly important predictor, the model is also readily extended to additional covariates such as the raw genotype likelihood or genotype call, which might provide additional information about genotyping error rates.
However, both strategies might be biased. Genotyping errors in a set of genotype calls considered to be accurate (the truth set), for instance, will result in an overestimation of genotyping error rates. Consistent biases in genotype calls affecting replicates similarly, on the other hand, might be difficult to infer and result in an underestimation of genotyping error rates. But despite these caveats, recalibrated genotype likelihoods are likely reflecting genotype uncertainty much more accurately, particularly for protocols with error rates as high as those we report here for RAD-seq. Indeed, we found here that genotype recalibration was essential to avoid drawing inaccurate conclusions and instead recovered biologically meaningful results about the ancestry of Populus hybrids.
We also note that if no tools accepting genotype likelihoods as input information are available for specific applications, this should not discourage users from incorporating genotyping uncertainty in the analyses. We have shown here that local ancestries were more reliably estimated by RASPberry (Wegmann et al. 2011), a tool requiring genotype calls, when adding the estimated genotyping error rate to the parameters of the model. But we note that given the particular lack of heterozygous genotypes in the RAD-seq data analyzed here, a model using a single per-genotype error rate as implemented in RASPberry was not sufficient to overcome all biases.
In conclusion, and in line with others (Mastretta-Yanes et al. 2015; Cooke et al. 2016; Cariou et al. 2016), we strongly suggest to carefully assess genotyping error rates in reduced representation sequencing experiments, and to properly account for those in downstream analyses, for instance using the tools we present here through Tiger. For this purpose, we recommend to either sequence a subset of individuals and markers at much higher quality, or to include sufficient replicates, from which genotyping error rates can be inferred. Knowledge on these error rates then allows to properly account for genotyping errors in downstream analyses, rather than losing a large amount of information due to stringent filtering. However, with ever dropping costs for sequencing and library preparation, low-depth whole-genome sequencing may become a valuable alternative in many applications. Indeed, simulations have shown that low-depth data spanning a larger fraction of the genome yield accurate and precise estimates of population genetics parameters (e.g. Buerkle & Gompert 2013; Kousathanas et al. 2017; Rustagi et al. 2017). The problem of high error rates in RAD-seq or other reduced representation libraries is thus likely transient and we expect that the field will quickly adopt new sequencing technologies that circumvent it entirely.
Data Accessibility
The code of Tiger is available through a git repository at https://bitbucket.org/wegmannlab/tiger. The RAD-seq data are available on the Sequence Read Archive through bioprojects PRJNA528699 and PRJNA528706. The called genotypes used to estimate genotyping errors are available at Zenodo (DOI 10.5281/zenodo.2604109 and 10.5281/zenodo.2604124).
Author Contributions
LB and DW conceived the study; CL and DW provided funding; LB collected genetic data; LB, CAB, VL and DW performed the analyses; CAB, CL and DW supervised the study; LB and DW wrote the manuscript with input and revisions from all co-authors.
Acknowledgements
We thank Kai N. Stölting, Camille Christe and Margot Paris for helpful insights and discussions, Thelma Barbará for help in the laboratory, David Frey for collecting seeds in the hybrid zone, and Santiago González Martínez for providing the Spanish samples. This work was supported by grant 31003A_149306 of the Swiss National Foundation to CL. All calculations were done on Vital-IT and the Bioinformatic Core Facilities of the Universities of Bern and Fribourg, and computers at the University of Wyoming.