Abstract
Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in complex quantitative traits. Data cleaning is more difficult in multiparent populations (experimental crosses derived from more than two founder strains), as individual SNP markers have incomplete information. We describe our process for cleaning SNP microarray genotype data from Diversity Outbred (DO) mice, illustrated with genotype data from the MegaMUGA array on a set of 291 DO mice. We first consider the proportion of missing genotypes in each mouse, as an indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to check mouse sex. We use the proportion of matching SNP genotypes in pairs of mice to detect sample duplicates. We use a hidden Markov model to calculate genotype probabilities across the genome for each mouse, which we use to estimate the number of crossovers in each mouse, and to identify potential genotyping errors. The estimated genotyping error rate, by mouse sample and by SNP marker, is a useful diagnostic. We also consider the SNP genotype frequencies by mouse and by marker, with markers grouped according to the minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele intensities from the microarray, colored by both observed and predicted genotypes, can be revealing about the underlying cause. But the detection of low-quality samples is more important than the detection of low-quality markers.
Introduction
Data cleaning is a critical first step in analyses to identify the quantitative trait loci (QTL) that contribute to a complex quantitative trait. Despite its importance, little has been written about this part of the process. Lincoln and Lander (1992) discussed the detection of genotyping errors in genetic map construction. Broman and Sen (2009, Ch. 3) discussed data cleaning more broadly, in biparental crosses. Morgan (2015), in presenting the R package argyle for quality control for SNP genotyping arrays, emphasized consideration of the amount of missing genotypes, frequency of heterozygotes, and the distribution of array intensities.
Multiparent populations (experimental crosses derived from more than two founder strains) have become a popular design for complex trait genetics in experimental organisms. Examples include heterogeneous stock (Mott et al. 2000; Mott and Flint 2002), MAGIC lines (Cavanagh et al. 2008; Kover et al. 2009), the Collaborative Cross (Churchill et al. 2004), and Diversity Outbred mice (Churchill et al. 2012; Svenson et al. 2012). Data cleaning is more difficult in multiparent populations, as individual SNP markers have incomplete information.
A key principle in data cleaning is to think about what might have gone wrong, and how it might be revealed in the data. Visualize the data in many ways, and if you see something weird, try to figure out what is going on.
We illustrate our process for cleaning genotype data using MegaMUGA SNP array data (Morgan et al. 2016) on 291 Diversity Outbred (DO) mice.
Material and Methods
Mice and genotypes
We consider genotype data for 291 Diversity Outbred (DO) mice, including 99 mice from generation 8 and 192 mice from generation 11. These are a subset of the mice considered in Gatti etal. (2017).
The mice were genotyped using the MegaMUGA SNP array (Morgan et al. 2016), which includes 77,808 markers. The genotyping was performed at Neogen (Lincoln, NE). Genotype calls using pairs of nucleotides A, C, G, and T were converted to genotypes AA, AB, BB, with A denoting the allele that was most frequent among the eight founder strains (assigned arbitrarily when the two alleles were equally frequent).
Statistical methods
Analyses were conducted in R (R Core Team 2018) and with R/qtl2 (https://kbroman.org/qtl2). Most of the statistical analyses involves visualization of summary statistics.
To reconstruct the genomes of the DO mice, we use a hidden Markov model (see Broman and Sen 2009, App. D). The transition probabilities for DO mice were taken from Broman (2012b), which uses the results of Broman (2012a). We used the genetic map from Cox et al. (2009), assumed a genotyping error rate of 0.2%, and used the Carter-Falconer map function (Carter and Falconer 1951). The HMM provides a probability for each possible 36-state diplotype at each marker for each mouse. (We use the term diplotype to refer to a pair of founder haplotypes.)
To estimate the number of crossovers in each individual, we inferred the 36-state diplotype at each locus as the state with maximum marginal probability, provided that it was > 0.5. (In cases where all diplotypes had probability < 0.5, the diplotype state was treated as missing.) We then calculated the minimum number of crossovers consistent with these inferred diplotypes.
To identify potential genotyping errors, we calculated the genotyping error LOD scores of Lincoln and Lander (1992), by first converting the 36-state diplotype probabilities to 3-state SNP genotype probabilities, using the SNP genotypes in the eight founder strains. We then used the observed SNP genotype to calculate the error LOD score as in equation 1b of Lincoln and Lander (1992).
To obtain predicted SNP genotypes for each DO mouse, we collapsed the predicted 36-state diplotypes to 3-state SNP genotypes using the SNP genotypes in the founders.
Data and software availability
The raw genotype data are available at FigShare (https://doi.org/10.6084/m9.figshare.7359542.v1). They are also available in the R/qtl2 input format at https://github.com/rqtl/qtl2data.
The SNP genotypes for the founder strains are at FigShare, https://doi.org/10.6084/m9.figshare.5404750.v2. We used annotations for the MegaMUGA genotyping array from https://github.com/kbroman/MUGAarrays.
R/qtl2 is available at https://kbroman.org/qtl2 and at GitHub, https://github.com/rqtl/qtl2. The custom R scripts used for our analyses and to create the figures are at GitHub (https://github.com/kbroman/Paper_MPPdiag).
Results
We consider data from a population of 291 Diversity Outbred (DO) mice (Gatti et al. 2017), including 99 mice from generation 8 and 192 mice from generation 11. There are 150 females and 141 males. The mice were genotyped with the MegaMUGA array, which includes 77,808 markers, but we focus on the 69,339 markers that are polymorphic among the eight founder strains.
Sample diagnostics
Missing data
Our first step in genotype data diagnostics is to look at the proportion of missing genotypes in each mouse, as this is a key indicator of sample quality. Samples with a high proportion of missing genotypes are likely low quality.
The percent missing genotypes by mouse are displayed in Figure 1. The bulk of mice are missing very little data, but there are 24 mice with >2% missing genotypes, including nine mice with ≥20% missing. The mice with high rates of missing genotypes will show up as outliers for many of the diagnostics below.
Verify sex
To verify the sexes of the mice, we could look at the proportion of heterozygous genotype calls on the X chromosome. However, with array-based genotypes, it is most informative to look at the probe intensities for SNPs on the X and Y chromosomes.
In Figure 2, we display a scatterplot of the average intensity for the 30 Y chromosome markers vs the average intensity for 2,058 X chromosome markers. Each point is a mouse, with the males in purple and the females in green. We omitted eighteen markers on the X chromosome that did not show a clear sex difference in allele intensities.
There is a distinct cluster of male mice in the upper-left (low X chromosome intensity and high Y chromosome intensity) and a cluster of female mice in the lower-right (high X chromosome intensity and low Y chromosome intensity). However, there is one male mouse, M377, who appears within the female cluster in the lower right, and so is likely really female. Also, there is one female mouse, F386, in the lower-left, with reduced X chromosome intensity. This is likely an XO female. Other outliers in Figure 2 are mice with high rates of missing genotypes, including the nine mice with ≥20% missing genotypes, which are labeled in orange.
For sex inference, the SNP probe intensities give better separation of the two sexes than heterozygosity on the X chromosome (Figure S1) and they enable us to distinguish between males and XO females. We could also, potentially, identify XXY males, who would have high average intensities on both sex chromosomes.
Sample duplicates
We look for potential sample duplicates by calculating the proportion of matching SNP genotypes for each pair of samples. The nine samples with ≥20% missing genotypes look quite different from others, and also show some chance similarities (Supplementary Figure S2A). If we omit those samples (Supplementary Figure S2B), we find that the bulk of pairs share around 50% of genotypes (shifted slightly below 50%; the median is 46.6%). A small proportion of pairs (179 pairs, or 0.4%) share a bit more, at around 67.1%.
Two pairs have almost perfectly matching SNP genotypes. Mice M283 and M292 have matching genotypes at all except one of the 69,025 markers at which they were both genotyped. Mice M377 and F409 have matching genotypes at all except 36 of the 68,291 markers at which they were both genotyped. Note that the second pair includes M377 which was seen in Figure 2 to have mislabelled sex.
These two pairs are clear duplicates. We are looking for a separation between the normal sharing between mice and the duplicates, and (having excluded samples with ≥20% missing genotypes), these are the only pairs with >76% matching genotypes. These unintended sample duplicates provide an estimate of the genotyping error rate, which looks to be well under 1/10,000.
We will omit one mouse from each pair, for subsequent analyses. For the M377/F409 pair, it seems clear that the sample corresponds to F409, since it is a female. But for M283/M292, we can not tell which is the correct label. For later QTL analyses we would likely wish to omit both samples, but here we will just omit the second one, M292.
If the data included genome-scale phenotypes with strong genetic signals, such as gene expression data, we would at this point look further for possible sample mix-ups (Westra et al. 2011; Broman et al. 2015), but we will not do so here.
SNP probe intensities
The distribution of probe intensities on the genotyping array can be a useful indicator of problem samples. In Figure 3A, we display density estimates of the array intensities for each of the 289 samples, after a log10(x + 1) transformation. We highlight the arrays corresponding to samples with appreciable missing genotypes. There are nine samples with ≥20% missing data, shown in orange; their array intensities are shifted to the left and have a long right tail. There are three samples with ~9% missing data (in pink); they show a much broader distribution of array intensities. There are eight samples with 2.5–5% missing genotypes (in blue); many of these have a spike of intensities near 0.
In Figure 3B, we display a scatterplot of the 1st and 99th percentiles of the log array intensities, in an attempt to summarize the pattern seen in the densities in Figure 3A. Most of the samples with high rates of missing genotypes are outliers. We label one sample (F326) which has ≥20% missing genotypes but has an array intensity distribution that is not as extreme as the other samples with ≥20% missing genotypes.
SNP genotype frequencies
The SNP genotype frequency distribution can be a useful diagnostic. For example, sample contamination can lead to excess heterozygotes. In DO mice, we split the SNP markers into four groups, based on the minor allele frequency (MAF) in the eight founder strains. To do so, we consider only the 68,357 with complete founder genotypes.
To display the genotype frequencies, we use a ternary plot, which makes use of the fact that for any point within an equilateral triangle, the sum of the distances to the three sides is constant. And so the trinomial SNP genotype frequencies for an individual may be represented by a point in the triangle, with the distance to the lower edge being the frequency of heterozygotes, and the distances to the left and right edges being the frequencies of the two homozygotes.
As seen in Figure 4, the DO mice form tight clusters with very similar SNP genotype frequencies, except for a few samples that show high heterozygosity, which are among the samples with high rates of missing genotype data. Another potential outlier is F313, which shows somewhat reduced heterozygosity in SNPs with MAF = 3/8 or 1/2.
Counts of crossovers
Another important diagnostic is to estimate the number of crossovers, genome-wide, in each mouse. Problem samples may exhibit excessive crossovers. In some cases, a sample may exhibit fewer crossovers than expected.
To estimate crossovers, we first reconstruct the diplotypes (that is, the pair of haplotypes) across the genome for each of the DO mice. We use a hidden Markov model (HMM) to calculate the probability of each of the 36 possible diplotypes (8 homozygotes plus 28 heterozygotes) at each genotyped SNP, given the multipoint SNP genotype data, with allowance for genotyping errors (Supplementary Figure S3). Figure S3A shows the inferred diplotypes across the genome for a single DO mouse, and Figure S3B shows the detailed diplotype probabilities for that mouse, along one chromosome.
There are a number of different methods for estimating the number of crossovers in a DO genome. We are using the simplest: at each genotyped SNP, pick the most probable diplotype (provided that it has probability at least 50%) and then calculate the minimal number of crossovers that are consistent with that set of diplotypes.
The estimated numbers of crossovers for each DO mouse is shown in Figure 5, with points colored according to their generation. The mice in generation 8 have an average of 304 crossovers, while those in generation 11 have an average of 357 crossovers.
We have excluded the nine mice with ≥ 20% missing genotypes. They all show > 500 crossovers, and the mice with > 50% missing genotypes show > 2000 crossovers. This is further evidence that these samples should be omitted.
There are no other apparent outliers, in the numbers of crossovers. But note the importance of taking account of generation number.
Genotyping error rates
The diplotype reconstructions offer the opportunity to identify likely SNP genotyping errors. To identify likely errors, we calculate the genotyping error LOD scores described by Lincoln and Lander (1992). For each SNP marker in each DO mouse, we use the founder strains’ SNP genotypes to collapse our 36-state diplotype probabilities to three-state SNP genotype probabilities. Considered in conjunction with the observed SNP genotype data, we calculate a LOD score statistic that indicates the likelihood that an individual SNP genotype is in error.
We generally focus on error LOD scores >2, which is a reasonably conservative threshold on potential errors. The estimated error rate in each mouse, with this criterion, is shown in Figure 6. The nine mice with ≥ 20% missing genotypes all have estimated genotyping error rates > 1%. Three mice have error rates in the range 0.5–1.0%, and these all showed ~9% missing genotypes. The next-highest estimated error rate is 0.4% for mouse M398, which had shown about 5% missing genotypes. The error rates for the other mice are extremely small. The median rate is just 7.8 in 10,000.
In summary, the nine mice with ≥ 20% missing genotypes also showed excessive crossovers and high rates of apparent genotyping errors. We will omit these from further analyses. Three mice with 5–10% missing genotypes showed slightly elevated genotyping error rates but no excess crossovers. We would not recommend omitting them.
Marker diagnostics
We now turn to the markers, to identify poorly behaved ones. As with the samples, we can look at the percent missing data, estimated rates of genotyping errors, and the genotype frequencies. Ultimately, we want to look at scatterplots of the allele intensities for the SNPs, which are most informative of problems, but with 77,808 markers, we cannot inspect all of them, and so we use the other measures to help narrow our search. Throughout these analyses, we will focus on the 280 mice with < 20% missing genotypes.
Missing data and genotyping errors
We start by studying the amount of missing data at each marker, as well as estimated genotyping error rates (based on genotyping error LOD scores > 2). A scatterplot of estimated error rate vs. percent missing data is shown in Figure 7.
The vast majority of markers have virtually no missing data: Of the 69,339 informative markers, 41,931 have no missing data, and 64,003 are missing < 2%. However, as seen in Figure 7, there are a number of markers with appreciable missing data: 1,216 are missing > 10%.
Similarly, the vast majority of markers have virtually no genotypes with error LOD score > 2, including 66,623 markers with no apparent errors. But 624 markers have estimated error rates > 2%, including 190 markers with estimated error rates > 10%. The thirteen markers with error rates > 50% are highlighted in Figure 7.
While it may not be apparent in Figure 7, there is a reasonably strong relationship between missing data and error rate: Markers with larger amounts of missing data tend to have higher genotyping error rates. For example, the mean error rate for markers with < 2% missing data is 3 per 10,000, while the mean error rate for markers with ≥ 20% missing data is 5%, about 190 times higher.
SNP genotype frequencies
The genotype frequencies at the markers are displayed in Figure 8, with markers split according to their minor allele frequency (MAF) among the 8 founder strains. The majority of markers conform reasonably well to our expectation. The most striking departure is that there are 22 markers with MAF=1/8 in the founders but where the minor allele in the DO mice is > 40%. The majority of these markers (19/22) are from a region on chromosome 2, 70 – 105 Mbp, where the WSB allele exhibits meiotic drive (Didion et al. 2016).
There are also a number of SNPs with MAF = 3/8 or 4/8 in the founder strains that have reduced frequency of heterozygotes. For example, there are 46 SNPs with MAF = 4/8 in the founders but heterozygosity < 0.25. Most of these SNPs (37) have ≥ 10% missing data or ≥ 10% genotyping errors, or both. Of the remaining nine markers, all but one are on chromosome 2, in the region with high WSB allele frequency.
SNP allele intensities
The most important diagnostic for SNP quality is a scatterplot of the allele intensities. This is particularly informative when colored by both the observed genotype and by the predicted genotype giving the multipoint SNP information. The SNP allele intensities for a set of four SNPs are displayed in Figure 9. Each point is a single DO mouse; in the left panels, the points are colored by the observed genotype, with yellow and blue corresponding to the two homozygotes, green the heterozygote, and gray being missing. In the right panels, the points are colored by the predicted genotypes given the multipoint SNP information. Figures 9A and 9B correspond to a well-behaved SNP. The three genotype groups form tight, well-separated clusters, and the observed and predicted genotypes match. (Additional examples of well-behaved markers are shown in Supplementary Figure S4.)
Figures 9C and 9D correspond to a SNP where there is an additional cluster of genotypes, and where the genotype calling algorithm assigned it to the wrong genotype. At this particular SNP, there is a cluster of genotypes that were called homozygotes for the major allele but that are really heterozygotes. This is an example of a variable intensity oligonucleotide (VINO; Didion et al. 2012), where there is an additional SNP within the microarray probe that leads to reduced intensity of one allele. For this particular SNP, the founder strains 129Sv/ImJ and PWK/PhJ appear to have null alleles for the array probe (data not shown). Additional examples of this type of SNP are shown in Supplementary Figure S5.
Figures 9E and 9F correspond to a SNP where the genotype calling algorithm made a clear mistake. There are three well-defined clusters of genotypes, but one of the homozygotes got called as a heterozygote: the cluster is green in Figure 9E but yellow (and gray, for missing) in Figure 9F. Figures 9G and 9H correspond to another example of a poorly called SNP, where two of the genotype clusters are not well separated, and the genotype calling algorithm made a mistake in identifying the clusters. Supplementary Figures S6 and S7 show additional examples of mistakes in the genotype calling, either because the genotype clusters are arranged horizontally or are not well separated. Figure S8 contains additional examples of particularly ugly SNPs.
Effects of data cleaning
Identifying problem samples (mix-ups, or low quality samples) is arguably the most important aspect of the data cleaning effort. With high density SNP data, the presence of a small number of poorly behaved SNPs should have little influence on the results, as the hidden Markov model to reconstruct the DO genomes allows for the presence of genotyping errors and so should smooth over the problem markers. But it is worth checking: if we omit the most poorly behaved markers, how much will the diplotype probabilities change?
Of the 69,339 informative markers, we omitted 325 markers with estimated genotyping error rate > 5%, using a threshold of 2 on the genotyping error LOD score to define a presumed error. We then re-calculated the 36-state diplotype probabilities for all mice.
For each mouse at each marker, we take the sum of the absolute differences between the diplotype probabilities, before and after data cleaning, as a measure of change. This takes values between 0 and 2, with 2 indicating that a complete different set of diplotypes have positive probabilities, after the data cleaning.
There are few changes in the diplotype probabilities. Of the 69,339 markers × 280 individuals, there are just 6,048 sites where the sum of the absolute differences was >1, and just 1,798 where it was >1.5. To illustrate the types of changes that are seen, Supplementary Figure S9 shows the diplotype probabilities, before and after omitting 325 bad markers, for three individuals on chromosome 9. In each case, an apparent recombinant segment gets removed.
Discussion
We have described our approach to cleaning genotype data from Diversity Outbred (DO) mice. We first look at the amount of missing data by individual, and then verify individuals’ sex and look for sample duplicates. We then look for bad samples (identified by the amount of missing data, the SNP genotype frequencies, the estimated number of crossovers, and the estimated proportion of genotyping errors) and for bad markers (again looking at the amount of missing data, genotyping frequencies, and the estimated proportion of genotyping errors).
In the data we used as an illustration, with 291 DO mice from generations 8 and 11, there were a set of nine samples with ≥ 20% missing genotypes. These also showed excessive crossovers and high genotyping error rates and should be omitted from any analyses. There were also two apparent sample duplicates (one being a male/female pair), and one apparent XO female. Finally, there were four samples with higher-than-normal amounts of missing data and genotyping errors, but these samples looked okay otherwise and probably do not need to be omitted.
The vast majority of markers appeared well behaved, but we also found a number of markers with a high proportion of apparent genotyping errors. Omitting these markers had small and relatively isolated effects on the diplotype probabilities. Our approach for identifying problem markers relied on our diplotype reconstructions, and so we may be missing some badly behaved SNPs, and the SNPs we miss may be the ones with the greatest influence on results. But if the overall SNP density is high, and the proportion of badly behaved SNPs is low, our approach should provide reasonable results.
Verifying sex and identifying sample duplicates are two steps towards identifying sample mix-ups. If there are phenotypes with strong genetic effects, such as coat color, they may be useful to identify further mix-ups. Particularly useful in this regard are genome-scale phenotype data such as gene expression data, whether by microarrays or RNA-seq, which can perfectly identify individuals (Westra et al. 2011). With gene expression data on multiple tissues, identified mix-ups can potentially be corrected (Broman et al. 2015).
We have not discussed the problem of cleaning phenotype data, but this is also important. We would focus on data visualizations, including histograms, plots of traits by time of measurement and/or by mouse identifiers, and scatterplots of traits against one another. These plots may reveal typographical errors in the data or inconsistencies in measurement units. They may also reveal important batch effects or other covariates that should be taken account in analyses, and may indicate the need for phenotype transformations (such as logarithm or square-root).
We used R (R Core Team 2018) and R/qtl2 (https://kbroman.org/qtl2) throughout this work. Another important R package for genotype diagnostics is argyle (Morgan 2015), which provides a variety of diagnostics for SNP genotyping arrays.
We focused on relatively simple data visualizations to diagnose potential problems, and that is the central tool for data cleaning. Make lots of graphs, focusing on graphs that will reveal anticipated problems but also following up on anything that looks strange: Is it a problem with the data, a problem with the sample, or a quirk of biology? Is it ignorable or fixable? What effect might it have on later results? We find interactive data visualizations, such as with R/qtlcharts (Broman 2015), useful in these efforts, particularly for identifying outlier samples in scatterplots.
While we have focused on DO mice, our approach could be applied more generally, to other multiparent populations. The key summary statistics are the proportion of missing genotypes, the average microarray probe intensities for SNPs on the X and Y chromosomes, the proportion of heterozygous SNPs, the estimated number of crossovers, and the estimated genotyping error rate.
Acknowledgments
This work was supported in part by National Institutes of Health grant R01GM070683 (to K.W.B. and G.A.C.).