Abstract
Short tandem repeat (STR) mutations may be responsible for more than half of the mutations in eukaryotic coding DNA, yet STR variation is rarely examined as a contributor to complex traits. We assess the scope of this contribution across a collection of 96 strains of Arabidopsis thaliana by massively parallel STR genotyping. 95% of examined STRs are polymorphic, and the median STR has six alleles. Modest STR expansions are found in most strains, some of which have evident functional effects. For instance, three of six intronic STR expansions are associated with intron retention. We infer selective constraint on STRs, and find the strongest signatures of purifying selection on coding STRs. Lastly, we detect dozens of novel STR-phenotype associations that could not be detected with SNPs, and validate some experimentally. Our results demonstrate that STRs comprise a large unascertained reservoir of functionally relevant genomic variation.
Introduction
Mutation rates vary within a genome by several orders of magnitude1, from ~10−8-10−9 for substitutions to 10−3-10−4 on average for short tandem repeats (STRs)2–4. STR mutations occur through addition or subtraction of repeat units. Given the prevalence of STR loci in eukaryotic genomes, we would expect more de novo STR mutations than single nucleotide substitutions in the human genome per generation4, even in coding regions5. Thus, while the overall mutation rate is under strong control from natural selection, some loci experience many more mutations than others6. The existence of such highly mutable loci contradicts the commonly-used infinite sites model7, which assumes that no locus mutates more than once in a population; it further poses complications for quantitative genetic models that assume contributions from many independent loci 8.
In spite of the large effect that STRs can have on complex traits and diseases in model organisms and humans9–11, their variation is largely not considered in genotype-phenotype association studies due to technical obstacles. However, STR genotyping methods of sufficient accuracy, throughput, and cost-effectiveness to ascertain STR alleles at high throughput have recently become available12–14. Studies leveraging these methods suggest considerable contributions of STRs to heritable phenotypic variation12,15.
From an evolutionary perspective, the high mutation rate and clear phenotypic effects of STRs are speculated to provide readily accessible evolutionary paths for rapid adaptation16–18. On larger time scales, the presence of STRs with high mutation rates in coding regions appears to be maintained by selection19–21. These observations independently argue for important functional roles of STRs.
In the present study, we apply massively parallel STR genotyping to a diverse panel of well-characterized A. thaliana strains. We use these data to generate and test hypotheses about the functional effects of STR variation, combining observations of gene disruption by STR expansion, inferences about STR conservation, and phenotypic association analyses with follow-up experiments. Based on our results, we argue that STRs must be included in any comprehensive account of phenotypically relevant genomic variation.
Results
STR genotyping reveals complex allele frequency spectra
We targeted 2,050 STR loci for genotyping with molecular inversion probes (MIPs)12 across a core collection of 96 A. thaliana strains (Methods). These loci were all less than 200bp in size and had nucleotide purity of at least 90%, encompassing nearly all gene-associated STRs and ~40% of intergenic STRs (Fig. 1a). We used comparisons with the Col-0 reference genome, PCR analysis of selected STRs, and dideoxy sequencing to estimate that MIP STR genotype calls were ~95% accurate, and inaccurate calls were generally only one to two units away from the correct copy number (Supplementary Note, Supplementary Fig. 1-4, Supplementary Table 1). 95% of STRs were polymorphic. Most STRs were highly multiallelic (Fig. 1b; mean=6.4 alleles, median=6 alleles), and this variation was mostly unascertained by the 1001 Genomes resource for A. thaliana (Supplementary Fig. 3a). Coding STRs were only slightly less polymorphic (mean=4.5 alleles, median=4 alleles), though whether this difference is due to selection or to mutation rate variation is unclear. 45% of STRs had a major allele with frequency < 0.5, confusing the familiar concepts of major and minor alleles, which have provided a common framework for detecting genotype associations (Fig. 1c). Specifically, the Col-0 reference strain carries the major STR allele at only 48% of STR loci. Moreover, rarefaction analysis implied that more STR alleles at these loci are expected with further sampling of A. thaliana strains (Supplementary Fig. 4c).
Principal component analysis of STR variation revealed genetic structure corresponding to Eurasian geography (Fig. 1d, Supplementary Fig. 5), consistent with previous observations that population structure is associated with geography in A. thaliana22,23. This result, which agrees with previous observations from a much larger set of genome-wide single nucleotide polymorphism markers, shows that a comparatively small panel of STRs suffices to capture detailed population structure in A. thaliana. Overall, we find that STRs are extremely allele-rich, and their variation reflects expected population structure.
Novel STR expansions are associated with splice disruptions
We next examined the frequency and functional consequences of STR expansions in A. thaliana. STR expansions, extreme high-copy-number variants of comparatively short STRs, are widely recognized as contributing to human disease24 and other phenotypes25. While large (>150 bp) expansions are difficult to infer, we detected modest STR expansions with a simple heuristic comparing the longest allele observed at each locus to the median allele (Fig. 2a). We found expansions in 64 of 96 A. thaliana strains, each carrying at least one expanded STR allele from one of 28 expansion-prone STRs (9 coding, 6 intronic, 8 UTR, 5 intergenic). Most expansions were found in multiple strains (Fig. 2b), although expansion frequency was likely underestimated due to a higher rate of missing data at these loci. STR expansions causing human disease can be as small as 25 copies24, whereas we detected expansions up to ~50 copies.
We assayed the effects of STR expansions on expression of associated genes. The most dramatic expansions (with large relative copy number increase) affected an intronic STR in the NTM1 gene (Fig. 2c; five other expansions also resided in introns) and a STR in the 3' UTR of the MEE36 gene (Fig. 2d). These genes, respectively, have roles in cell proliferation and embryonic development. Intronic STR mutations can disrupt splicing, causing altered gene function26 and human disease27. Therefore, we assayed effects on splicing of all six expanded intronic STRs. In three cases, the expanded allele was associated with partial or full retention of its intron (Supplementary Fig. 6, Supplementary Note), including the major NTM1 splice form in the Mr-0 strain (Fig. 2e). We confirmed intron retentions by dideoxy sequencing of cDNA (Supplementary Fig. 7a). Retention of the NTM1 intron is predicted to lead to a nonsense mutation truncating most of the NTM1 protein (Supplementary Fig. 7b). For the other two retentions, more complex and STR allele-specific mRNA species were formed (Supplementary Fig. 6, Supplemental Text). The MEE36 STR expansion alleles were associated with dramatically reduced MEE36 transcript levels (Fig. 2f), possibly due to the STR expansion altering transcript processing28. These examples emphasize the potential for previously unascertained STR variation to modify gene function. Moreover, considering STR allele frequency distributions, as we did here by focusing on outliers, enables predictions about STR functional effects.
Signatures of functional constraint on STR variation
Using our observed STR allele frequency distributions, we next attempted to infer selective processes acting on STRs. Previous models for evaluating functional constraint on STRs are few29, though there is consensus that selection affects STR variation17,19,29,30. Naively, we would expect that coding STRs should show increased constraint (lower variation). Consistent with this expectation, we observed that most invariant STRs are coding (53 of 84 invariant STRs genotyped across at least 70 strains; odds ratio = 4.5, p = 5*10−11, Fisher's Exact Test). However, methods of inferring selection by allele counting are confounded by population structure and by mutation rate, which varies widely across STRs in this (Fig. 3a) and other studies3,31.
Therefore, to account for mutation rate and population structure, we used support vector regression (SVR) to predict STR variability across these 96 strains, using well-established correlates of STR variability (e.g. STR unit number and STR purity; Online Methods)3,32,33. Selection was defined as deviation from expected variation of a neutral STR among strains. We trained SVRs on the set of intergenic STRs, which should experience minimal selection relative to STRs associated with genes (Supplementary Fig. 8-10). We used bootstrap aggregation of SVR models to compute a putative constraint score for each STR by comparing its observed variability to the expected distribution from bootstrapped SVR models (Fig. 3b, Supplementary Note, Supplementary File 1).
According to constraint scores, 132 STRs were less variable than expected, suggesting purifying selection on these loci (Fig. 3c). Among these, 63 STRs were coding (OR = 2.4, p = 3.7*10−6, Fisher's Exact Test); this enrichment of coding STRs among constrained STRs agrees with our naïve analysis of invariant STRs above. Examples of constrained coding STRs included STRs encoding homologous polylysines in three different histone H2B proteins. Generally, coding STRs showing purifying selection encoded roughly half as many polyserines and twice as many acidic homopolymers as expected from the A. thaliana proteome (Supplementary Table 3)34. Although many more coding STRs are probably functionally constrained, our power to detect such constraints is limited by the size of the dataset. We also observed high conservation of some STRs in non-coding regions, though this is less interpretable given the ambiguous relationship between sequence conservation and regulatory function in A. thaliana35. The most constrained intronic STR, in the BIN4 gene, which is required for endoreduplication and normal development36, shows a restricted allele frequency spectrum compared to similar STRs (Fig. 3d).
Hypervariable coding STRs (showing more alleles than expected) were too few for statistical arguments, but nonetheless showed several notable patterns (Supplementary Note, Supplementary Fig. 11). For example, 3/24 hypervariable coding STRs encoded polyserines in F-box proteins, suggesting that STR variation may serve as a mechanism of diversification in this protein family, which shows dramatically increased family size and sequence divergence in some plant lineages37,38. One non-coding hypervariable STR was associated with the Chromomethylase 2 (CMT2) gene, which is under selection in A. thaliana39. Specifically, CMT2 nonsense mutations in some populations are associated with temperature seasonality. We considered whether the extreme CMT2 STR alleles might be associated with these nonsense mutations. Instead, these extreme alleles exclusively occurred in strains with full-length CMT2 (Fig. 3e). Strains with the common CMT2 nonsense mutation form a tight clade in the CMT2 sequence tree, whereas the CMT2 STR length fluctuates rapidly throughout the tree and appears to converge on longer alleles independently in different clades (Fig. 3f). These convergent changes are consistent with a model in which the CMT2 STR is a target of selection.
We further assessed whether STR conservation can be attributed to cis-regulatory function. The abundance of STRs in eukaryotic promoters40, and their associations with gene expression (in cis)1541, have suggested that STRs affect transcription, possibly by altering nucleosome positioning41. We examined whether STRs near transcription start sites (TSSs) showed signatures of functional constraint (Fig. 3g). We found little evidence for reduced STR variation near TSSs, suggesting that cis-regulatory effects do not generally constrain STR variation. Moreover, we found no relationship between putative constraint and whether STRs reside in accessible chromatin sites marking regulatory DNA (Fig. 3h).
STRs yield numerous novel genotype-phenotype associations
We next addressed whether STR genotypes contribute new information for explaining phenotypic variation. It is often assumed that STR phenotype associations should be captured through linkage to common single nucleotide polymorphisms (SNPs)42. This assumption is inconsistent with available data. Human STR data42,43 and simulation studies44 indicate decreased linkage disequilibrium (LD) between STRs and SNPs, making it uncertain whether SNP markers can meaningfully tag STR genotypes. Indeed, we found little evidence of LD between STR and SNPs in A. thaliana (Fig. 4a), Moreover, the observed LD around STR loci declined with increasing STR allele number (Supplementary Fig. 12), consistent with an expected higher mutation rate at multiallelic loci7,44. This result suggests that STR-phenotype associations need to be directly tested rather than relying on linkage to SNPs. A. thaliana offers extensive high-quality phenotype data for our inbred strains, which have been previously used for SNP-based association studies45. We tested each polymorphic STR for associations across the 96 strains with each of 105 published phenotypes. For the subset of 32 strains for which RNA-seq data were available46, we also tested for associations between STR genotypes and expression of genes within 1MB (i.e., eQTLs; Supplementary Note). Although our power to detect eQTLs was limited by sample size, we detected 12 associations. The strongest association was between a STR residing in long noncoding RNA gene AT4G07030 and expression of the nearby stress-responsive gene AtCPLI (Supplementary Fig 13, Supplementary Table 4).
We next focused on organismal phenotypes. Certain STRs showed associations with multiple phenotypes, and flowering time phenotypes were particularly correlated with one another (Fig. 4c,d). Similar to these patterns, SNPs have also shown associations with multiple phenotypes, and correlated flowering time phenotypes are among the strongest associations in the same strains45. As in previous association studies using STRs15, some inflation was apparent in test p-values compared to expectations, though the same tests using permuted STR genotypes showed negligible inflation (Fig. 4b, Supplementary Fig 14). Negligible inflation with permuted genotypes has been used previously to exclude confounding from population structure15, which we will also presume here (Supplementary Note). We found 137 associations between 64 STRs and 25 phenotypes at stringent genome-wide significance levels (Methods; Supplementary Table 5). Given the low LD observed between STRs and other variants, STRs are likely to be the causal variants, rather than merely tagging them. Our analysis found plausible candidate genes, such as COL9, which acts in flowering time pathways and contains a flowering time-associated STR, and RABA4B, which acts in the salicylic acid defense response47 and contains a STR associated with lesion formation.
We evaluated whether these associations might have been found using SNP-based analyses, and whether STR effects are large enough to be meaningful. We found that STR effects on phenotype are largely not accounted for by nearby SNP variation. Considering the strongest association for each STR, only 18 of the 64 STRs were near potentially confounding SNP variants, and most associations (14/18) were robust to adjustment for nearby SNP genotypes45 (Supplementary Table 6, Supplementary Note). One notable exception was an STR closely linked to a well-known deletion of the RPS5 gene in a hypervariable region of chromosome 148 that causes resistance to bacterial infection49. RPS5 status is under balancing selection in A. thaliana48. In this case, the association and the linkage are apparently strong enough (the STR is ~4KB upstream of the deletion) that this STR tags RPS5s effect on infection. The observation of a STR tagging a hypervariable region leads us to speculate that STR variation holds information about genomic regions with complex mutational histories.
To assess the STR contribution to the variance of a specific trait, we performed a naïve variance decomposition of the long-day flowering phenotype into SNP and STR components, as represented by the loci showing associations with this trait. Our results suggested that STRs potentially contribute as much or more variance than SNPs to this phenotype (Supplementary Note, Supplementary Table 7). Estimated effect sizes for STR variants on this phenotype were similar to those of large-effect SNP variants45 (Supplementary Fig 15).
Finally, we used mutant analysis to evaluate the two strongest flowering time associations, a coding STR in AGL65 and an intronic STR in the uncharacterized gene AT4G01390; neither locus had been associated with flowering time phenotypes. We found that disruptions of both STR-associated genes had modest early flowering phenotypes (by ~2 days and ~1 rosette leaf, p < 0.05 for each in linear mixed models; Supplementary Fig 16), supporting the robustness of our STR-phenotype associations. Taken together, our study suggests that STRs contribute substantially to phenotypic variation.
Discussion
Our results imply that STRs contribute substantially to trait heritability in A. thaliana. Far from being “junk DNA”, STRs are apparently constrained by functional requirements, STR variation can disrupt gene function, and is associated with phenotypic variation. Considering that STR variation is represented only poorly through linkage to nearby SNPs, STRs likely contribute to the missing heritability of complex traits.
STRs are particularly relevant to the phenotypic variance due to de novo mutations (Vm). Estimates of Vm from model organisms are on the order of 1%, but may vary substantially from trait to trait50. STRs are good candidates for a substantial proportion of this variance, given their high mutation rate, residence in functional regions, and their functional constraint observed here. In previous work51, we showed apparent copy number conservation of a STR in spite of a high mutation rate. In this case, deviation from the conserved copy number produced aberrant phenotypes. Our observation that constrained STRs are common suggests that STRs are a likely source of deleterious de novo mutations which are removed by selection.
The extent to which STRs affect phenotype is only partially captured in this study. Specifically, we assayed two STRs shown to cause phenotypic variation in prior transgenic A. thaliana studies25,52, but these STR loci did not show strong signatures of phenotypic association or of selection. This lack of ascertainment suggests that many more functionally important STRs exist in A. thaliana than we can detect with the analyses presented here. For example, the polyQ-encoding STR in the ELF3 gene causes dramatic variation in developmental phenotypes52, yet we find no statistical associations between this locus and phenotype across our 96 strains. In this case the lack of phenotype association is expected; ELF3 STR alleles interact epistatically with several loci53 and associations are thus difficult to detect. Indeed, we have argued that STRs are more likely than less mutable classes of genomic variation to exhibit epistasis9. In consequence, we expect that the associations described in the present study are an underestimate of STR effects on phenotype. Moreover, our data are constrained by MIP technology, which limits the size and composition of STR alleles that we can ascertain (Figure 1a, Supplementary Figures 1-2).
Considering next a mechanistic perspective, the association we observe between intronic STR expansions and splice disruptions may be an important mechanism by which STRs contribute to phenotypic variation. In humans, unascertained diversity of splice forms contributes substantially to disease54, and this diversity is larger than commonly appreciated55. We demonstrate that this mechanism is common at least for expansions; future work should evaluate how tolerant introns are to different magnitudes of STR variation, as these effects on protein function may prove to be both large and of relatively high frequency. The phenotypic contributions of loci with high mutation rate remain underappreciated, specifically in cases where such loci are difficult to ascertain with high-throughput sequencing. The results presented here argue that STRs are likely to play a substantial role in phenotypic variation and heritability. Accounting for the heterogeneity of different classes of genomic variation, and specifically variation in mutation rate, will advance our understanding of the genotype-phenotype map and the trajectory of molecular evolution.
Methods
Online Methods
Probe design
We used TRF56 (parameters: matching weight 2, mismatching penalty 5, indel penalty 5, match probability 0.8, indel probability 0.1, score ≥ 40 and maximum period 10) to identify STRs in the TAIR8 build of the Arabidopsis thaliana genome, identifying 7826 putative STR loci under 200bp (Supplementary File 2). We restricted further analysis to the 2409 loci with repeat purity >=89%. We chose 2307 STRs from among these, prioritizing STRs in coding regions, introns, or untranslated regions (UTRs), higher STR unit purity, and expected variability (VARscore32). We designed57 molecular inversion probes (MIPs) targeting these STR loci in 180bp capture regions with 8bp degenerate tags in the common MIP backbone. For this purpose, we converted STR coordinates to the TAIR10 build and used the TAIR10 build as a reference genome. We used single nucleotide variants (SNVs) in 10 diverse Arabidopsis thaliana strains58 to avoid polymorphic sites in designing MIP targeting arms. We filtered out MIPs predicted to behave poorly (MIPGEN logistic score < 0.7), discarded MIPs targeting duplicate regions, and substituted MIPs designed around SNVs as appropriate. We attempted to re-design filtered MIPs with 200bp capture regions using otherwise identical criteria. This yielded a final set of 2050 STR-targeting MIP probes (Supplementary File 3).
MIP and library preparation
These 2050 probes were ordered from Integrated DNA Technologies as desalted DNAs at the 0.2 picomole scale and resuspended in Tris-EDTA pH 8.0 (TE) to a concentration of 2 pM and stored at 4° C. We pooled and diluted probes to a final stock concentration of 1 nM. We phosphorylated probes as described previously59. We performed DNA preparation from whole aerial tissue of adult A. thaliana plants. We prepared MIP libraries essentially as described previously12,59 using 100 ng A. thaliana genomic DNA for each of 96 A. thaliana strains.
Sequencing
We sequenced pooled capture libraries essentially as previously described12 on NextSeq and MiSeq instruments collecting a 250bp forward read sequencing the ligation arm and captured target sequence, an 8bp index read for library demultiplexing, and a 50bp reverse read sequencing the extension arm and degenerate tag for single-molecule deconvolution. In each run, 10% of the sequenced library pool consisted of high-complexity whole-genome library to increase sequence complexity. For statistics and further details of data acquired for each library see Supplementary Tables 8 and 9.
STR annotation
We annotated STRs according to Araport1160, classifying all STRs as coding, intronic, intergenic, or UTR-localized, and indicating whether each STR overlapped with transposable element sequence. To identify regulatory DNA, we used the union of seven distinct DNaseI-seq experiments61 covering pooled or isolated tissue types. For additional details, see the Supplementary Note.
Sequence analysis
Sequences were demultiplexed and output into FASTQ format using BCL2FASTQ v2.17 (Illumina, San Diego). We performed genotype calling essentially as described previously12, with certain modifications (Supplementary Note, Supplementary Table 1). Note that our A. thaliana strains are inbred, and more stringent filters and data processing would be necessary to account for heterozygosity. For information about comparison with the Bur-0 genome, see the Supplementary Note. Updated scripts implementing the MIPSTR analysis pipeline used in this study are available at https://osf.io/mv2at/?view_only=d51e180ac6324d2c92028b2bad1aef67.
Statistical analysis and data processing
We performed all statistical analysis and data exploration using R v3.2.162. For plant experiments, we fit mixed-effects models using Gaussian (flowering) or binomial generalized linear models using experiment and position as random effects and genotype as a fixed effect.
STR expansion inference
We inferred STR expansions where the maximum copy number of an STR is at least three times larger than the median copy number of that STR. Various alleles of STR expansions were inspected manually in BAM files. Selected cases were dideoxy-sequenced and analyzed as described.
Plant material and growth conditions
Plants were grown on Sunshine soil #4 under long days (16h light: 8h dark) at 22°C under cool-white fluorescent light. T-DNA insertion mutants were obtained from ABRC63 (Supplementary Table 11). For flowering time experiments, plants were grown in 36-pot or 72-pot flats; days to flowering (DTF) and rosette leaf number at flowering (RLN) were recorded when inflorescences were 1cm high. Results are combined across at least three experiments.
Gene expression and splicing analysis
We grew bulk seedlings of indicated strains on soil for 10 days, harvested at Zeitgeber time 12 (ZT12), froze samples immediately in liquid nitrogen, and stored samples at -80°C until further processing. We extracted RNA from plant tissue using the SV RNA Isolation kit (including DNase step; Promega, Madison, WI), and subsequently treated it with a second DNAse treatment using the Turbo DNA-free kit (Ambion, Carlsbad, CA). We performed cDNA synthesis on ~500ng RNA for each sample with oligo-dT adaptors using the RevertAid kit (ThermoFisher, Carlsbad, CA). We performed PCR analysis of cDNA with indicated primers (Supplementary Table 10) and ~25 ng cDNA with the following protocol: denaturation at 95° 5 minutes, then 30 cycles of 95° 30 seconds, 55° 30 seconds, 72° 90 seconds, ending with a final extension step for 5 minutes at 72°. We gel-purified and sequenced electrophoretically distinguishable splice variants associated with STR expansions. Each RT-PCR experiment was performed at least twice with different biological replicates.
Population genetic analyses
For PCA, STRs with missing data across the 96 strains were omitted, leaving 987 STRs with allele calls for every strain. was estimated using the approximation 0.5,7 where is the average frequency of all STR alleles at a locus. We computed multiallelic linkage disequilibrium estimates for SNP-SNP and SNP-STR locus pairs using MCLD64. We downloaded array SNP data for the same lines (TAIR9 coordinates) from http://bergelson.uchicago.edu/wpcontent/uploads/2015/04/call_method_75.tar.gz65. For each locus, both SNP and STR, we computed linkage disequilibrium scores with 150 surrounding loci. To facilitate comparison, we computed lowess estimates of linkage only for those locus pairs in the plotted distance window in each case, and only for locus pairs with r2 > 0.05.
Inference of conservation
STRs typed across 70 A. thaliana strains or fewer were dropped from this analysis, as the estimates of their variability were unlikely to be accurate, leaving 1825 STRs. We measured STR variation as the base-10 logarithm of the standard deviation of STR copy number (Supplementary Note). We used bootstrap aggregation (“bagging”) to describe a distribution of predictions as follows. An ensemble of 1000 support vector regression (SVR, fit using the ksvm() function in the kernlab package66) models was used to predict expected neutral variation of each STR as quantified by each measure (Supplementary Note). We used this distribution of bootstrapped predictions for intergenic STRs to compute putative conservation scores (Z-scores) for each STR. Scores below the 2.5% (Z < -3.46) and above the 97.5% (Z > 3.65) quantiles of intergenic STRs were considered to be putatively constrained and hypervariable respectively.
eQTL inference
We downloaded normalized transcriptome data for A. thaliana strains from NCBI GEO GSE8074446. We used the Matrix eQTL package67 to detect associations, fitting also 10 principal components from SNP genotypes to correct for population structure. Following precedent15, we fitted additive models assuming that STR effects on expression would be a function of STR copy number. We used Kruskal-Wallis rank-sum tests to test the null hypothesis of no association following correction.
Genotype-phenotype associations
We downloaded phenotype data from https://github.com/Gregor-Mendel-Institute/atpolydb/blob/master/miscellaneous_data/phenotype_published_raw.tsv. We followed precedent45 in log-transforming certain phenotypes. In all analyses we treated STRs as factorial variables (to avoid linearity assumptions) in a linear mixed-effect model analysis to fit STR allele effects on phenotype as fixed effects while modeling the identity-by-state kinship matrix between strains (computed from SNP data) as a correlation structure for strain random effects on phenotype. We performed this modeling using the lmekin() function from the coxme R package68. We repeated every analysis using permuted STR genotypes as a negative control to evaluate p-value inflation, and discarded traits showing such inflation. We used p < 10−6 as a genome-wide significance threshold commensurate to the size of the A. thaliana genome and the data at hand. For flowering time phenotypes we used a more stringent p < 10−10 threshold, as these phenotypes showed somewhat shifted p-value distributions (which were nonetheless inconsistent with inflation, according to negative controls). We identified potentially confounding SNP associations using the https://gwas.gmi.oeaw.ac.at/#/study/1/phenotypes resource, using the criterion that a SNP association must have a p <= 10−4 and be within roughly 100KB of the STR to be considered. We fit models including SNPs as fixed effects as before, and performed model selection using AICc69. Additional details about association analyses are in the Supplementary Note.
Data and code availability
MIP sequencing data are available in FASTQ format at the Sequence Read Archive, under project number PRJNA388228. Analysis scripts are provided at https://osf.io/5jm2c/?view_only=324129c85b3448a8bd6086263345c7b0, along with data sufficient to reproduce analyses.
Acknowledgments
We thank Keisha Carlson, Alberto Rivera, and members of the Queitsch lab for technical assistance and important conversations. We thank Evan Boyle, Choli Lee, Matthew Snyder, and Jay Shendure for assistance and advice concerning MIP design and use. We thank the Dunham Lab and the Fields Lab for access to and help with sequencing instruments. We thank UW Genome Sciences Information Technology for high-performance computing resources. This work was supported in part by NIH New Innovator Award DP2OD008371 to CQ.