Abstract
Spontaneous mutations are the source of new genetic variation and are thus central to the evolutionary process. In molecular evolution and quantitative genetics, the nature of genetic variation depends critically on the distribution of fitness effects (DFE) of mutations. Spontaneous mutation accumulation (MA) experiments have been the principal approach for investigating the overall rate of occurrence and cumulative effect of mutations, but have not allowed the effects of individual mutations to be studied directly. Here, we crossed MA lines of the green alga Chlamydomonas reinhardtii with its unmutated ancestral strain to create haploid recombinant lines, each carrying an average of 50% of the accumulated mutations in a variety of combinations. With the aid of the genome sequences of the MA lines, we inferred the genotypes of the mutations, assayed their growth rate as a measure of fitness, and inferred the DFE using a novel Bayesian mixture model that allows the effects of individual mutations to be estimated. We infer that the DFE is highly leptokurtic (L-shaped), and that a high proportion of mutations increase fitness in the laboratory environment. The inferred distribution of effects for deleterious mutations is consistent with a strong role for nearly neutral evolution. Specifically, such a distribution predicts that nucleotide variation and genetic variation for quantitative traits will be insensitive to change in the effective population size.
Introduction
Understanding evolution requires an understanding of the origin of new genetic variation from mutation. This includes knowing the rates of mutation at individual loci and the magnitudes of their effects on fitness and other traits. Of particular interest is the distribution of fitness effects for new mutations (the DFE), describing the relative rates of occurrence of mutations with different selective effect sizes. The DFE informs about the frequencies of small-versus large-effect mutations and the frequencies of advantageous versus deleterious mutations, and is therefore of fundamental importance in population and quantitative genetics. For example, the DFE appears in the nearly neutral model of molecular evolution (Ohta 1977), where deleterious mutations are effectively selected against in large populations, but behave as selectively neutral in small populations. Kimura (1979; 1983) showed that if the DFE is strongly leptokurtic (L-shaped) molecular genetic variation at sites subject to natural selection increases slowly with increasing effective population size (Ne), and molecular evolution is potentially constant between species with different effective size. This is therefore broadly consistent with empirical observations. The DFE is also important for predicting selection response for quantitative traits and the nature of quantitative genetic variation (Robertson 1967). For example, the contribution of mutation to response to selection depends critically on the shape of the DFE, the response occurring more quickly with more leptokurtic distributions (Hill 1982). Analogously with the relationship between nucleotide variation and Ne, genetic variation for fitness (or a trait correlated with fitness) is predicted to increase slowly as a function of Ne if the DFE is leptokurtic (Keightley and Hill 1990), and could thus explain why genetic variation for quantitative traits is apparently relatively invariant between species (Potsma 2014).
In the light of its fundamental importance, there has been much previous work aimed at inferring the DFE. Two different approaches have principally been applied for spontaneous mutations occurring in the whole genome (rather than just a single locus): the analysis of nucleotide polymorphism data from a sample of individuals from a population, and spontaneous mutation accumulation (MA) experiments (reviewed by Eyre-Walker and Keightley 2007). Under the former approach (Eyre-Walker et al 2006; Keightley and Eyre-Walker 2007; Boyko et al. 2008; Tataru et al 2017), the site frequency spectra for putatively neutral and selected sites (typically synonymous and nonsynonymous sites of protein-coding genes, respectively) are compared, and parameters of the DFE for the mutations at the selected sites inferred. The approach makes several assumptions, notably that variation at the selected sites is explained by a balance between an input of new deleterious mutations, natural selection and genetic drift, and that selection is absent from the putatively neutral class of sites. It is only capable of inferring the DFE for mutations that stand an appreciable chance of segregating in the sample of individuals from the population, implying that inferences are only relevant to mutations with effects that are not substantially greater than 1/Ne. This can be an extremely small value if Ne is large. Furthermore it can only be applied to specific functional categories of sites in the genome.
In a spontaneous MA experiment, sublines of the same initial genotype are maintained at small effective population size in the near absence of natural selection for many generations, allowing mutations to accumulate effectively at random. The DFE can be estimated using the among-MA line distribution of phenotypic values for traits related to fitness (such as fecundity or viability) (Keightley 1994; García-Dorado 1997; Shaw et al 2002). The information that can be obtained by this approach is extremely limited, however, principally because the numbers of mutations carried by individual lines are not included in the analysis, so an overall genomic rate parameter has to be estimated, and this is highly confounded with the DFE parameters (Keightley 1998; Halligan and Keightley 2009).
Genome sequencing technology now allows the identification of the nearly complete complement of mutations carried by a set of MA lines, and in combination with phenotypic information this can potentially be used to leverage information on the DFE (Katju and Bergthorsson 2018). Previous analysis of spontaneous MA experiments have, however, only studied the cumulative effects of new mutations, whereas accurate inference requires estimation of the effects of individual mutations. For example, we have shown that there is a negative correlation between the number of new mutations carried by a MA line and fitness, but this gives only limited information on the DFE (Kraemer et al. 2017).
Previously, we carried out a spontaneous MA experiment in the single-celled green alga Chlamydomonas reinhardtii for ~1,000 generations, have measured fitness-related traits in a range of environmental conditions (Morgan et al 2014; Kraemer et al 2016; 2017), and have employed genome sequencing to determine the complement of mutations carried by the lines (Ness et al 2015). Here, we have crossed six of these C. reinhardtii MA lines of the CC-2931 genetic background with a compatible ancestor of the same background genotype, but of the opposite mating type. We thereby generated 1,526 recombinant lines (RLs), each carrying an average of 50% of the mutations of the MA line parent in different combinations. We genotyped the RLs at the locations of the known mutations and assayed their growth rate as a measure for fitness. Across the six lines there are nearly 400 unique mutations, so an analysis where each mutation is treated as a fixed effect is not appropriate. Instead, we developed a MCMC approach with a random effects model in which mutation effects are assumed to be sampled from some distribution or a mixture of distributions. We investigate a number of distributions to infer the distribution of effects for the individual mutations on growth rate. We show that the DFE is highly leptokurtic (L-shaped) and that a high proportion of mutations increase fitness in the laboratory environment.
Results
To directly infer the DFE, we crossed six C. reinhardtii MA lines derived from the CC-2931 strain to an ancestral strain of the same genetic background and the opposite mating type to produce a total of 1,526 recombinant lines (RLs) (Table 1, Table S1). We genotyped 386 of the 476 mutations detected in our previous whole-genome sequencing study (Ness et al. 2015) (Table 1, Table S2). Among the 681 different recovered haplotypes, mutations were present at an average frequency of 49.1% (10.3% - 85.4%), which is close to the expected average of 50% (Figure S1). The number of haplotypes obtained for each MA line and their frequencies were quite variable, however (Table S3). For example, we obtained 214 haplotypes for MA line L03, and no haplotype was found more than four times, whereas we obtained only 67 haplotypes for MA line 14 and one of these haplotypes was found 18 times.
Relationship between number of mutations and growth rate
As a measure of fitness, we assayed the maximum growth rate of each RL, the parental MA lines and the unmutated ancestral strain in liquid culture. To determine whether mutations have an overall directional effect on fitness, we used mixed models to test for a relationship between the number of mutations carried by RLs and their ancestors and fitness (Figure 1). In the case of only one of the six MA lines (L03), including number of mutations leads to a significantly better fit (P = 0.0008; Table 2), and the improvement in fit for an analysis of the combined data set of all six MA line crosses is nonsignificant (P = 0.080). This could either mean that there is insufficient power to detect mutational effects, or that there is a mixture of mutations with positive and negative effects on fitness. The latter explanation is supported, because there is a highly significant between-haplotype component of variation for the trait (P < 2.2×10−16 for the whole data set; P between 4.1×10−13 (L14) and 0.022 (L06) for the individual MA lines). We repeated this analysis fitting number of mutations of specific types (SNP, indel, exonic, intronic, intergenic; Table S4). Including the number of mutations gave a significantly better fit in the cases of MA line L03 for all mutation types except intronic, for MA line L11 for intronic mutations and for the whole data for exonic mutations.
Inference of the DFE by MCMC assuming mutation effects fall into discrete categories
The relationship between mutation number and fitness tells us little about the DFE for individual mutations. We therefore developed an approach that allows the DFE be estimated in a Bayesian mixture model (implemented by Markov Chain Monte Carlo, MCMC). This assumes that the effects of mutations either come from a mixture of point masses or a mixture of gamma distributions. To maximize power, we focussed much of the analysis on a merged data set of all six MA line backcrosses.
We first examined whether there is evidence for an overall directional effect of new mutations on growth rate by running the analysis while assuming a two category model with one non-zero effect category (effect = e1, proportion = q1) and one zero-effect category (i.e., e0 = 0, q0 = 1 - q1). The results (Table 3, Figure S2, S3) suggest that there is an appreciable frequency (~4%) of mutations reducing growth rate by ~3%, whereas the majority of mutations are allocated to the zero-effect category.
We then analysed the combined data set assuming a model with three categories of mutational effects (one zero-effect category and two finite effect categories, e1 and e2). As expected, given that the two-category model supports the presence of negative mutational effects, a category of negative effects is inferred (e1, Table 3; Figure S4, S5). This has a similar posterior mode as the two-category model, but the credible interval is somewhat wider, as expected for a more data rich model. There is also support for a class of positive-effect mutations (e2, Table 3), which has a somewhat lower absolute modal estimate than the negative category. The estimated frequency of positive-effect mutations is only slightly lower than that of the negative-effect mutations. Results from the analysis of data from individual MA line crosses (Table S5) are consistent with the presence of a mixture of negative- and positive-effect mutations.
We then analysed data sets in which phenotypic values were permuted within plate. As expected under the null model, the distributions of estimated values of e1 and e2 centre on zero and the estimates of e1 and e2 from the real data are well outside the distributions obtained from permuted data (Figure S6).
Analysis of a model with four categories of effects (one of which is a zero-effect category) also gives negative and positive posterior modes for two classes of mutational effects e1 and e2. However, it is difficult to determine whether there is an additional mutational class e3 that is different from the zero-effect class or e1 and e2 because of the presence of label switching (Jasra et al 2005) between the three classes of effects and their frequencies (data not shown).
Two-sided gamma DFE model
Although informative about the overall directional effects of mutations, models in which mutations fall into discrete categories are unrealistic, because they assume no variance among the effects of mutations within each category. We therefore analyzed the combined data set for the six MA line crosses under a two-sided gamma distribution of effects, which assumes that the effects of mutations are continuously distributed. We assumed the gamma distribution, because it is a flexible two-parameter distribution (α = scale, β = shape) that can take a wide variety of shapes, ranging from a highly leptokurtic, L-shaped distribution (β → 0) to a point mass (β → ∞). We assumed that positive- and negative-effect mutations can have different absolute means, but their distributions have the same shape parameter. The results from the analysis of the combined data set (Table 4; Figure S7) suggest that the DFE is highly leptokurtic (i.e., β is close to 0.3), and that the means for positive-and negative-effect mutations (= β/α) are very small, reflecting the concentration of mutations with effects close to zero. Consistent with the analysis assuming discrete classes of mutations (Table 3), there is a substantial proportion of positive-effect mutations (i.e., ~80%; Table 4). The estimated DFE for the two-sided gamma distribution is shown along with that for the three category point mass DFE in Figure 2.
The credible intervals for the absolute means of negative- and positive-effect mutations do not overlap (Table 4), suggesting that the model with different means is more parsimonious that a model assuming a two-sided gamma distribution with the same means (Table 4; Figure S8). A model with different shape parameters for negative- and positive-effect mutations gives similar estimates for the mean effects and proportion of positive-effect mutations as the model with a single shape parameter, but stable estimates of the shape parameters could not be obtained, suggesting that this model is over-parameterised. We also analysed a more complex mixture model that, in addition to gamma-distributed negative- and positive-effect mutations, incorporated a class of zero-effect mutations. This allows a discontinuous DFE that has a “spike” at zero. However, the modal estimate for the frequency of the zero-effect class was zero and its upper credible value was 1%, indicating that the two-sided gamma distribution captures the leptokurtic shape of the DFE adequately.
Relationships between estimated mutation effects and mutation types
To investigate whether mutations in certain mutation classes (such as exonic/non-exonic) are more or less likely to be associated with fitness, we calculated the effect of each mutation (as the posterior mean) under the two-sided gamma distribution model and then computed the difference between the average squared effects for mutations in mutually exclusive annotation classes. We examined average squared differences, because the additive variance contributed by a mutation is proportional to its squared effect. The results are negative in the sense that there are no statistically significant relationships for any of the mutation types tested (Table 5).
Discussion
In this paper, we integrate information on the fitness of MA lines, ancestral lines and crosses between MA lines and their ancestors with the complement of mutations carried by each line or cross. By crossing MA lines with their ancestors, each RL is expected to contain a different complement of mutations, which can be determined by genotyping. If there is sufficient replication, it is possible to estimate the individual phenotypic effects of mutations. The total number of mutations genotyped in the six MA lines studied was 386, however, implying that the effects of most mutations must be very small, and estimation of a fixed effect for each mutation is inappropriate. We therefore developed a random effects model, fitted a mixture of distributions using MCMC, and obtained Bayesian estimates of the parameters of the distributions. We investigated models in which each mutation is assigned to one of a number of classes of fitness effects (which includes a class with zero effect) or we assume that mutation effects are drawn from a mixture of gamma distributions. Our approach has similarities to the Bayesian mixture model method BayesR (Moser et al 2015) developed to estimate the distribution of SNP effects in genome-wide association studies. BayesR simultaneously analyses all informative SNPs (we likewise include all mutations) and fits a mixture of distributions of SNP effects, including a zero-effect class. Specifically, BayesR estimates the relative frequencies of the zero-effect class and a mixture of normal distributions of SNP effects with fixed variances. In this respect, BayesR differs from our method, where we estimate discrete categories of effects or gamma distribution parameters as variables in the model, and we also simultaneously estimate the frequencies of mutations in the different effects categories or gamma distributions.
Previous approaches to infer the distribution of fitness effects for spontaneous mutations (the DFE) using data from MA experiments have compared the distributions of estimated trait values for MA lines and unmutated controls. The simplest approach is the Bateman-Mukai method (Bateman 1959; Mukai 1964), which uses the changes of trait mean and genetic variance between MA lines and unmutated controls to estimate a genomic mutation rate parameter (U, the frequency of mutations with an effect on the trait) and the average effect of a mutation (e̅), while assuming that all mutations have the same effect. The information that can be obtained by the Bateman-Mukai method, and other approaches that use the full distribution of MA line phenotypic values (Keightley 1994; García-Dorado 1997; Shaw et al 2002), is extremely limited, however (reviewed by Halligan and Keightley 2009). The limitation arises because the genomic mutation rate and e̅ are confounded with one another under the Bateman-Mukai approach, so the DFE and U are also confounded, and there is little information to distinguish between alternative models for the DFE if the effects of mutations are assumed to vary (Keightley 1998).
For five of the six CC-2931 MA line crosses, there is a negative relationship between growth rate and the number of mutations carried by a RL, although in some cases the relationship is very weak. This result is broadly consistent with the tendency for most C. reinhardtii MA lines to have lower growth rate than their ancestors (Morgan et al 2014; Kraemer et al 2016), and with Kraemer et al (2017), who generally observed negative relationships between fitness measured in competition with a marked strain and the numbers of mutations carried for MA lines of several genetic backgrounds. Kraemer et al (2017) also attempted to estimate a multi-category DFE based on the relationship between mutation number and fitness, but the amount of information available was limited, principally because there were only 10-14 MA lines tested of each genetic background. Here, we have characterized 1,526 RLs and a large number of combinations of genotypes, and therefore expect this design to be more powerful for inferring properties of the DFE than previous approaches that analysed individual MA lines.
We first investigated models in which mutation effects fall into discrete categories, including a zero-effect category. Under a two category model, there is a strong signal of growth rate-reducing mutations (estimated effect ~ −3%), consistent with the overall negative effect of spontaneous mutations we previously observed. The majority of mutations (~96%) are, however, allocated to the zero-effect class. Under a three class model, most mutations are also allocated to the zero-effect class, there is a negative-effect category with similar fitness effect and frequency as in the two class model, and a third category of positive-effect mutations (effect ~+2% on fitness). The frequency of positive-effect mutations is ~6%, but the credible interval is very wide. We then analysed a two-sided gamma distribution model, in which there are different means for the distributions of positive- and negative-effect mutations. Arguably, this is more realistic than the multi-category model, which assumes that mutation effects are invariant within a category. Consistent with the results from the analysis of the model with three discrete categories, there are both negative-and positive-effect mutations, and the proportion of positive-effect mutations is surprisingly high (~80%). The distributions for negative- and positive-effect mutations are highly leptokurtic (i‥e., the estimate of the shape parameter is ~0.3), and the absolute means of the distributions are both <1%, reflecting the concentration of density around zero. It appears that the effects of positive mutations are smaller than negative mutations, and the amount of mutational variance contributed by positive-effect mutations is ~20% that of negative-effect mutations. The fit of the estimated two-sided gamma distribution of effects is compared to the frequency distribution of the estimated effects of the individual mutations in Figure 3. Overall, the fit to the expected distribution is reasonable, although it appears that the inferred distribution may under-fit negative-effect mutations. There is one mutation with a positive effect of +5% (a G→C mutation in the 3'UTR of a gene on chromosome 6 of unknown function) and several mutations with absolute negative or positive effects >1%. The annotations associated with the 10 mutations with the highest absolute effects (i.e., the most extreme 2.5%) are shown in Table S6. There is no significant enrichment of any annotation we tested for these most extreme effects (or for the most extreme 5%; data not shown).
Why are we seeing a high proportion of positive-effect mutations? One possibility is that mutations that increase fitness are common in natural populations, and this is reflected in MA experiments (Shaw et al 2003; Rutter et al 2018). An alternative view is that deleterious mutations predominate in nature, principally because organisms are well adapted to the environments they typically experience (Keightley and Lynch 2003). Consistent with this, functional elements of the genome are typically conserved (Graur et al 2013), and analysis of the frequency of amino acid and synonymous polymorphisms within populations suggests that advantageous amino acid mutations are infrequent (Schneider et al 2011). A second possible explanation is that the algae were assayed in an environment which the species does not encounter in the wild, and some mutations that are deleterious in nature increase growth rate in the laboratory. A third possibility is that natural selection could not be prevented during the MA experiment, and there was either positive selection for mutations increasing growth or negative selection acting on mutations decreasing growth rate. This could take the form of between colony selection, if the fastest growing colonies were picked preferentially. Alternatively, there could be within colony selection, if new advantageous mutations occurring during colony expansion rise in frequency, or new deleterious mutations are removed during colony expansion. The effective population size was ~7 (Morgan et al. 2014), and we infer that few mutations have positive effects >10%, so any substantial selection for positive-effect mutations seems unlikely. On the other hand, deleterious mutations with effects >10% (including lethal or near-lethal mutations) would be under-represented.
The inferred DFE is is highly leptokurtic; many mutations have a very small effect and their is a long tail of large effect mutations. Under the reflected gamma distribution model the shape parameter of the distribution of negative- and positive-effect mutations is ~0.3. This is close to the value assumed by Kimura (1979, 1983) when analysing the nearly neutral model of molecular evolution, and is therefore consistent with the observation that amino acid variation is relatively insensitive to Ne (Ohta 1977; Kimura 1983), since a high proportion of sites in the Chlamydomonas genome are in protein-coding exons. It is also relevant to the narrow range of variation observed at synonymous and noncoding sites (Leffler et al 2012), if such sites become effectively selected in populations of large effective size. Such a leptokurtic distribution also has implications for the response to artificial selection and maintenance of variation for quantitative traits. If mutation effects are drawn from a leptokurtic distribution, then the response from new mutations builds up quicker than under the infinitesimal model, but is more variable, since response depends on the chance appearance and fixation of mutations with large effects (Hill 1982). A weak relationship between genetic variance for fitness or a correlated trait and Ne is also predicted (Keightley and Hill 1990).
To our knowledge, our approach of crossing MA lines to their ancestors and genotyping and phenotyping the crosses has not been previously attempted. It is related to that applied to induced mutations in RNA viruses (Sanjuan et al 2004) and in mismatch repair–deficient E. coli (Robert et al 2018). A limitations of our approach is that some mutations we previously identified by whole-genome sequencing (Ness et al 2015) were not amenable to genotyping. Specifically, some classes of mutations, such as large indel events or transposable element insertions, were not detectable by our short-read sequencing study, or may have occurred in regions that could not be aligned to the reference genome (Ness et al 2015). The approach is limited by the precision of phenotyping for mutations with small effects on growth rate. In general, laboratory-based measurements of mutation effects on fitness have been limited to those stronger than 10−3 (Gallet et al 2012). Mutations with effects of this magnitude or below will be allocated to the zero-effect category under the discrete class model or will have estimated effects close to zero under the two-sided gamma distribution model. Such mutations might be effectively selected in natural populations, however. The approach therefore has the capability of informing about mutations which may be under such strong selection in nature, and as such rarely segregate in natural populations. Other approaches that focus on the frequency distribution of segregating polymorphisms (Eyre-Walker et al 2006; Keightley and Eyre-Walker 2007; Boyko et al. 2008; Tataru et al 2017) inform about weakly selected mutations and therefore complement the present approach.
Materials and Methods
Mutation accumulation lines and the ancestral strain
Production and sequencing of MA lines of six strains of C. reinhardtii has been described previously (Morgan et al. 2014, Ness et al. 2015). Here, we focus on MA lines derived from CC-2931, a strain first sampled in Durham, North Carolina in 1991 that has a typical mutation rate among several strains we investigated (Ness et al. 2015) and decreasing mean fitness with increasing mutation number (Kraemer et al. 2017).
The MA lines and their ancestral strain are of the same mating type (mt-), so we first produced a “compatible ancestor” to which the MA lines could be crossed. This was done by backcrossing CC-2931 to a strain of the opposite mating type (CC-2344, mt+) for 13 generations with the aim of producing a strain identical to CC-2931, except for the region around the mating type locus on chromosome 6. Genome sequencing of the compatible ancestor (using the method of Ness et al. 2015) unexpectedly revealed, however, non-CC-2931 regions not only on chromosome 6 but also on chromosomes 4, 5, and 16 (Figure S9, S10) comprising a total of 7.6% of the genome and leaving 13 pure CC-2931 chromosomes. We dealt with this issue by including markers for these regions as factors in the analyses (see below).
Generation of first generation recombinant lines (RLs)
For each MA line, we set up nine independent matings with the compatible ancestor, and collected 32 recombinant lines (RLs) from each to obtain a total of 288 RLs per MA line. Matings were set up by inoculating cultures for both parents into 200 µl of liquid Bold's medium (Bold 1942), and incubating these under standard growth conditions (23°C, 60% humidity, constant white light illumination) while shaking at 180 rpm for four days. Nitrogen-free conditions are required to trigger mating in C. reinhardtii (Sager & Granick 1954), so we centrifuged the cultures (3500 g, 5 minutes), removed the supernatant, and added 200 µl of nitrogen-free liquid Bold's medium. We then mixed 50 µl each of MA line and compatible ancestor cultures and incubated the matings for approximately 24 hours under standard growth conditions to allow zygotes to form at the surface. The zygote mats were transferred to Petri dishes containing Bold's agar and incubated in the dark for five days to allow zygote maturation. To kill any vegetative cells associated with the zygote mats, the Petri dishes were exposed to chloroform for 45-60 seconds. Subsequently, the Petri dishes were incubated under standard growth conditions until the matured zygotes had germinated. As controls for the chloroform treatment, 30 µl of both of the unmated parents of each of the mating reactions were subjected to the same procedure, and the respective mating reaction was discarded if any growth was observed. After successful germination, 2 ml of liquid Bold's medium was added to the Petri dishes to allow the germinated cells to go into suspension. The suspensions were then diluted and spread onto new Petri dishes containing Bold's agar and incubated under standard growth conditions until individual colonies had grown sufficiently to be picked. Initially 36 individual clones representing individual RLs were picked from each mating and transferred into 200 µl liquid Bold's medium and incubated under standard growth conditions while shaking at 180 rpm for three days. Finally, 32 of the 36 picked RLs were transferred onto Bold's agar in 7 ml bijou containers for long-term storage.
Sample preparation for DNA extraction and genotyping
We used the competitive allele-specific PCR (KASPTM, Kompetitive Allele Specific PCR) technology to genotype the RLs of each MA line, the corresponding MA line, the compatible ancestor and the original unmutated ancestral strain (CC-2931) at the locations of the mutations previously reported for the MA lines (Ness et al. 2015). For allele-specific primer design, DNA regions of 1,000 base pairs (bp) surrounding each mutation were extracted from the C. reinhardtii reference genome (strain CC-503; version 5.3; Merchant et al. 2007). The regions were then corrected to match the consensus sequence of the CC-2931 MA lines.
For DNA extraction, we obtained cell pellets of at least 50 mg as follows. We inoculated the RLs and ancestors into 200 µl of liquid Bold's medium and incubated these under standard conditions with shaking at 180 rpm for four days. The cultures were then transferred to individual wells of 6-well plates filled with 6 ml of Bold's agar and incubated under standard conditions until a thick lawn had grown. Cells were then scraped off, transferred to 2 ml tubes and frozen at −70 °C. DNA samples were extracted from the frozen cell pellets and genotyped by LGC Genomics (http://www.lgcgenomics.com) using the sequences flanking each mutation of interest.
In addition to genotyping the known mutations, we genotyped markers that distinguish the mating types and the non-CC-2931 regions (Figure S9). For the mating type locus we designed markers matching loci specific to the two mating types, the fus1 locus for the mt+ mating type and the mid locus for the mt-mating type. For the non-CC-2931 regions we included markers for sites that differed between the two strains within these regions.
Determination of RL mating types by crossing
In addition to using genetic markers, we determined mating type using crosses. In separate mating reactions, we mated each RL with the ancestral strain and with the compatible ancestor, using a modification of the mating protocol described above, in which we extended the incubation period for the mating reaction under standard growth conditions to approximately 48 hours and then incubated plates in the dark for 5 days. To kill vegetative cells, we then incubated the plates for 5 hours at −20 °C, added 100 µl of a Bold's medium containing twice the amount of nitrogen as Bold's medium, and incubated the plates under standard growth conditions while shaking at 180 rpm until zygotes had germinated. We assigned mating type for each RL based on the combined results of the mating test and the mating type genotyping test. If one test failed, we used the result of the other. If the tests disagreed or both failed, a mating type was deemed not assignable and was recorded as missing data.
Measurement of growth rate
To generate growth curves for the individual RLs and their parents (i.e., the corresponding CC-2931 MA line and the compatible ancestor), we inoculated each of these separately into individual wells of 96 well plates containing 200 µl of liquid Bold's medium. Each plate contained samples from 58 RLs, all derived from the same MA line and their parental lines. We allocated lines randomly among the 60 central wells to avoid plate edge effects (Morgan et al. 2014) and filled the outer wells with 200 µl of medium to maintain humidity and reduce evaporation in the central wells. All plates were initially incubated for four days under standard conditions. On day 4, we transferred 2 µl of each culture to the corresponding well on a new 96-well plate filled with 198 µl liquid Bold's medium to start the growth assay. As an estimate of cell density, we measured absorbance at 650 nm every 12 hours over a period of 96 hours. We repeated this complete procedure twice in order to have two temporally independent replicates for each RL.
Maximum growth rate can be estimated from each growth curve as the slope of the linear regression of the natural log (ln) of absorbance on time during the exponential phase of growth. Unfortunately, the start and duration of exponential growth varied between growth curves, so we were unable to simply estimate growth over the same time window for each growth curve. Instead we used the following procedure. For each growth curve we generated a number of 48 hour time windows which spanned 4 measurements in our growth curves. The first started at 12 hours and ran to 60 hours, the second started at 24 hours and ran to 60 hours and so on until we had all possible windows up until 96 hours. For each window we then carried out a regression of ln absorbance on time. The slope of this regression line gives us an estimate of the rate of increase during this time period, whilst the proportion of the total variation in growth rate explained by the linear regression on time (the R2 value) gives an estimate of how well the linear relationship fits the data. We carried out this procedure for windows of 60 hours (5 time points), 72 hours (6 time points) and 84 hours (7 time points). We then excluded any windows for which the fit of the linear model was not adequate (R2 < 0.75). We then examined the slope estimates from each of the remaining windows and used the highest estimate as our measure of maximum growth rate for that growth curve. Visual inspection of the fitted lines on the time series showed that this procedure was effective in identifying the period of maximum growth for the variety of observed growth trajectories. For a total of 8 RL replicate time series measures, an adequate fit was not achieved for any of the time windows due to extremely unusual growth trajectories, and these were excluded from further analysis (Table S1).
Data processing and preparation
Mutations that were invariant across all samples were considered as artifactual and excluded. We also excluded mutations that were in complete linkage with either the mating type locus or one or more marker from the non-CC-2931 regions (Table S2). In the case of 21 mutations, only one of the two allele-specific primers worked successfully, and consequently no genotype information on the mutation was available for about 50% of the RLs. We corrected such mutations by changing the missing genotype to the non-amplified allele (Table S7). We excluded RLs for which genotypes of more than 10% of mutations were missing and/or for which more than 5% of mutations were assigned as heterozygous. C. reinhardtii is haploid, and multiple heterozygous calls suggest that the RL contains several different genotypes, and potentially cross-contamination. The rationale for setting these thresholds for missing data and heterozygous calls came from plotting the distributions of percentages of missing data and heterozygous calls for the complete data set. Only a small number of RLs have more than 10% of missing mutations and/or have more than 5% of the mutations assigned as heterozygous (Figure S11).
After carrying out the above filtering steps, several missing genotypes remained, so we imputed as many as possible to facilitate analyses. We first assigned missing genotypes for cases where RLs of apparently identical haplotype originated from the same mating reaction. In a second step, we computed the squared measure of linkage disequilibrium between pairs of mutations (r²) (Charlesworth and Charlesworth 2010). We then examined the remaining mutations that have missing genotypes in turn. If r² between a mutation and its neighbouring mutation was above 0.7, we imputed its allelic state using the state of the neighbouring mutation. 1,982 (0.97%) of the total 205,351 data points across all MA lines were initially missing (=number of mutations x number of samples including all replicates of RLs and ancestors). With our imputation approach, we were able to impute 1,766 (89%) of them so that only 216 (0.11%) missing data points remained.
Relationship between number of mutations and growth rate
To examine the relationship between RL growth rate and the number of mutations carried, we fitted a linear mixed model to the combined data set from all 6 MA lines and to the individual MA lines, with growth rate as the response variable and the number of mutations carried as a continuous linear predictor. To control for other sources of variation we also fitted mating type and all markers for the non-CC-2931 regions as fixed factors, and MA line, haplotype and growth assay plate as random factors. The significance of the number of mutations was examined by comparing models with and without this term, using a likelihood ratio test. The analysis was also done for specific mutation types (SNP, indel, exonic, intronic, intergenic). Models were fitted using the lme4 (Bates et al. 2015) package in R (R Core Team 2018). The data along with the R code are provided in the supplementary online materials.
Inference of the distribution of effects of mutations for growth rate
We developed a MCMC approach to infer the distribution of effects of mutations for growth rate, assuming two kinds of models. We modelled a discrete distribution in which each mutation’s effect fell into one of a number (nc) categories, and a two-sided gamma distribution allowing different parameters for the distributions of negative- and positive-effect mutations. To control for the effects of mating type and presence/absence of non-CC-2931 chromosomal regions, we estimated nf two-level fixed effects. Normally distributed plate effects and residual effects and the overall mean were also fitted. When analysing a merged data set of all six MA line crosses, a different mean was fitted for each MA line and any RL with one or more missing genotypes was excluded.
The data for the RLs bred from one MA line are represented in Table 6. Let nb be the number of RL observations and let nm be the total number of mutations in the MA line. Mutations are encoded in a nb x nm matrix, M, whose elements (0 or 1) indicate the presence or absence of a mutation in a RL. The fixed effects associated with each observation are encoded in a matrix, F, of dimension nb x nf with elements 0 or 1. Plate numbers (np levels) and phenotypic values associated with each observation are vectors r and y, respectively, both of dimension nb.
Each MCMC iteration, the state of one of the model’s variables (which are elements of vectors or scalars defined in Table 7-9) is proposed, then accepted or rejected based on change in log likelihood. Posterior distributions of the model variables provide Bayesian parameter estimates.
Multi-category model
Under this model, one category of mutations has no effect on fitness, and the remaining categories have non-zero effects. The mutation category vector (m) specifies the category in which each mutation currently resides, the value zero signifying that a mutation is in the zero effect category (Table 7). Elements of m are proposed by randomly picking an integer in the range 0 ‥ 1 – nc, which is different from the current value. State variables for the effects and frequencies associated with each category are encoded in vectors e and q, respectively. The first element (e0) of e is fixed at zero, since it is the effect of the invariant zero-effect class, and the first element of q is set to , the frequency of the zero-effect class. Proposals for the remaining elements of q are random uniform deviates added to the current value. Changes to the values of all other variables are drawn from normal distributions with mean zero. The variances of the uniform and normal distributions of proposal deviates are adjusted during a burn-in phase so that about 25% of proposals are accepted.
Proposals are accepted or rejected by applying the Metropolis-Hastings algorithm based on the log likelihood of the data and the priors (which are designed to be uninformative), given the parameter values. The overall log likelihood contains contributions from the numbers of mutations in different categories, their frequencies, the random plate effect, and each observation, which are considered independent. Let v be a vector of dimension 0 to 1 – nc containing the numbers of mutations in each of the nc categories in the current proposal, and multinomial(nc, q, v) be the probability of sampling v from a multinomial distribution parameter q. Let normal(y, y̅, Ve) be the normal distribution probability density function for point y with mean y̅ and variance Ve. Let gi be the genotypic value of RL i, which is the sum of the effects of the mutations it carries. This is calculated from the set of mutations carried by the RL (specified in M), the categories into which these mutations fall (specified in m) and the effect associated with each category (specified in e): where the square brackets denote vector or matrix indexing, i.e., e[x] = ex. The overall log likelihood of the data is then:
Note that the model with three categories (including a zero-effect category) is equivalent to a model with a mixture of two gamma distributions both with shape parameters → ∞ plus a zero-effect category (see below).
Two-sided gamma distribution model
Under the two-sided gamma distribution model (whose variables are defined in Table 9), the current state of a mutation in the MCMC is defined by two variables. The first is whether the mutation has a negative or positive effect, encoded as 0 or 1 in vector μ. The second variable is the genotypic effect of the mutation, encoded in a matrix E of dimension [0‥1]x[1‥nm]. The current value of the element of μ selects the mutation’s current genotypic effect, i.e., for mutation i the genotypic value is E[μi][i]. The frequencies of negative- and positive-effect mutation are encoded in vector q. The scale and shape of the gamma distributions for negative and positive-effect mutations are encoded in vectors α and β, respectively. Proposals for q0 are random uniform deviates added to the current value, such that 0 < q0 < 1 and q1 = 1 - q0. A proposal for an element of μ is 0 if the current value is 1 and vice versa. Changes to the values of all other variables are drawn from normal distributions with mean zero with adjustment during the burn-in as described above.
Let gamma(x, α, β) be the gamma distribution PDF for point x, given scale and shape parameters α and β, respectively. Let v be a vector (with two elements indexed by 0 and 1) containing the numbers of mutations with negative and positive effects in the current proposal, and binomial(q0, v0) be the probability of sampling v0 negative-effect mutations, given that the frequency of negative-effect mutations is q0. Let gi be the genotypic value of RL i (the sum of the effects of the mutations it carries, as above). This is calculated from the set of mutations carried by the line (specified in M), the types into which these mutations fall (i.e., negative or positive specified in μ) and the effect of each mutation (specified in E): where δj takes the value −1 if μj is 0 and +1 if μj is 1. The overall log likelihood of the data is:
We considered models where the shape parameter of the gamma distributions for negative- and positive-effect mutations were the same or allowed to be different. We also implemented a somewhat more complex mixture model in which there are three categories of mutations: a zero-effect category and gamma-distributed positive and negative-effect mutations, as described above.
Running the MCMC
MCMC runs started with a burn-in of 108 iterations for multi-category models or 109 iterations for two-sided gamma distribution models. Parameter values were then sampled every 10,000 iterations for 9×108 iterations for multi-category models or for 5×109 iterations for two-sided gamma distribution models. From each sampled iteration the vector of state variables was stored for generation of plots of parameter values against iteration number or posterior density plots. The posterior mode was taken as the parameter estimate, and 95% credible intervals computed on the basis of ranked parameter values. Priors for fixed effects, plate effects and the overall mean and variance were uninformative. The prior for mutation frequencies was a uniform distribution bounded by 0 and 1, and were therefore informative. Priors for the mutation effect parameters (under the multiple category model) were uniform in the range +/− 0.5 phenotypic standard deviations. Priors for the mean of the gamma distributions (under the two-sided gamma distribution model) were uniform in the range zero to 0.5 phenotypic standard deviations. Priors for the shape parameters of the gamma distributions were uniform in the range 0.1 to 100.
To check whether signals detected in the data were genuine, phenotypic values for fitness were permuted among backcross lines within plates without replacement. The distribution of estimates for parameters of interest obtained from such permuted data sets were computed. Significant estimates from the original data were expected to lie outside these distributions.
Simulations
To check the method, simulated data sets with either two, three or four categories of mutational effects or a two-sided gamma DFE and 40 mutations in each data set were analysed as described above, while assuming the same model as simulated. In all cases posterior modes are close to the parameters values of the simulations (Figure S12-S15).
Author contributions.
Generated RLs [SAK, KBB, DM]. Carried out RL growth assays [SAK, KBB, DM, TSS, NC, PDK]. Carried out compatible ancestor generation, sequencing and analysis [JL, RWN, KBB]. Carried out RL genotyping and associated analysis and analysis of growth curves [KBB, TSS]. Carried out mating type assays [TSS]. Implemented and carried out MCMC analysis [PDK]. Wrote paper [PDK, KBB, NC]. Conceived study [RWN, NC, PDK].
Supplementary Materials
Supplementary figures
Supplementary tables
Acknowledgements.
We thank Jarrod Hadfield for advice on the Bayesian mixture model and its implementation and Adam Eyre-Walker and Tom Booker for helpful comments on the manuscript. The work described was funded by a grant from the UK Biotechnology and Biological Sciences Research Council (BBSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 694212).