Abstract
The effect of a mutation on the organism often depends on what other mutations are already present in its genome. Geneticists refer to such mutational interactions as epistasis. Pairwise epistatic effects have been recognized for over a century, and their evolutionary implications have received theoretical attention for nearly as long. However, pairwise epistatic interactions themselves can vary with genomic background. This is called higher-order epistasis, and its consequences for evolution are much less well understood. Here, we assess the influence that higher-order epistasis has on the topography of 16 published, biological fitness landscapes. We find that on average, their effects on fitness landscape declines with order, and suggest that notable exceptions to this trend may deserve experimental scrutiny. We explore whether natural selection may have contributed to this finding, and conclude by highlight opportunities for further work dissecting the influence that epistasis of all orders has on the efficiency of natural selection.
1 Introduction
One of the more evocative pictures of evolution is that of a population climbing the fitness landscape [41, 48]. This image was originally proposed by Sewall Wright [81] to build intuition into his [80] and R.A. Fisher’ s [22] technical treatment of Darwin’ s theory of natural selection in finite populations under Mendelian genetics [55]. The topography of the fitness landscape represents the strength and direction of natural selection as local gradients that influence the direction and speed with which populations evolve.
While several distinct framings of the fitness landscape have been suggested [55], here we employ the projection of genotypic fitness over Maynard Smith’ s sequence space [40]. Sequence space is a discrete, high-dimensional space in which genotypes differing by exactly one point mutation are spatially adjacent. Thus, proximity on the fitness landscape corresponds to mutational accessibility, and selection will try to drive populations along the locally steepest mutational trajectory. (See [75] for several processes not readily captured by this construction.)
The most obviously interesting topographic feature of the fitness landscape is the number of maxima, a point already recognized by Wright [81]. Two (or more) maximum can constrain natural selection’ s ability to discover highest-fitness solutions, since populations may be required to transit lower-fitness valleys on the landscape en route. (Though see [29, 72] for the population genetics of that process, sometimes called stochastic tunneling [15, 29].)
1.1 Epistasis and fitness landscape topography
Epistasis is the geneticist’ s term for interactions among mutational effects on the organism [50]. For example, genetically disabling two genes whose products act in the same linear biochemical pathway can have a much more modest effect than the sum of the effects of disabling either gene in isolation. Conversely, disabling two functionally redundant genes can have a much more substantial effect than expected. (Indeed, such observations have taught us quite a bit about the organization of biochemical pathways, e.g., [2].)
Epistatic interactions between mutations can occur for any organismal trait, including fitness. Importantly, epistasis for fitness has an intimate connection to the topography of the fitness landscape, a fact also already appreciated by Wright [81]. For example, multiple peaks require the presence of mutations that are only conditionally beneficial (called sign epistasis [53, 75]). More generally, an isomorphism exists between fitness landscapes defined by mutations at some L positions in the genome and the suite of epistatic interactions possible among them. This follows because, while any particular mutation can appear on 2L-1 different genetic backgrounds (assuming two alternative genetic states at each position), each such mutation-by-background pair corresponds to a distinct adjacency in sequence space. Consequently, arbitrary differences in the fitness effect of a mutation across genetic backgrounds (i.e., epistasis of all order) can generically be represented on the fitness landscape [75].
1.2 Higher order epistasis and natural selection
Widespread epistasis between pairs of mutations has been recognized in nature for over 100 years [50, 74], and the corresponding theory is fairly advanced (e.g., [5,79]). However, interactions can themselves vary with genetic background, called higher-order epistasis [13, 74]. And while it is now becoming clear that higher-order interactions are commonplace in nature [36, 46, 68, 74], their influence on natural selection is less well understood (though see [59]). Clearly, epistatic interactions of all order can render natural selection less efficient (e.g., [75, 79]). It is thus reasonable to suppose that higher-order epistasis might be particularly likely to confound selection’ s ability to efficiently improve fitness. To test this idea, we relied on the fact that empirically observed fitness landscapes are sampled by virtue of their occupancy by natural populations, which are themselves the products of natural selection. This motivated our specific hypothesis: that epistatic influence on the topography of naturally occurring fitness landscapes should decline with epistatic order. We tested this prediction using 16 published biological fitness landscapes.
2 Methods
2.1 The order of epistatic interactions
Any set of L biallelic loci defines 2L genotypes, each with 2L potentially independent fitness values. Simultaneously, there are distinct subsets of k mutations that in principle can also independently contribute to a genotype’ s fitness. In total, there are thus subsets of mutations (i.e., the power set of L mutations). This counting reflects the isomorphism between any fitness landscape and its corresponding suite of epistatic terms [74].
We designate interactions among any subset of k mutations as kth-order epistasis. Note that here first-order “ epistasis” is degenerate in the sense that it represents the fitness effects of each of the L mutations in isolation. And our zeroth-order “ epistatic” term is the benchmark, fitness landscape-wide mean, relative to which the effect of each subset of mutations is computed.
2.2 The Fourier-Walsh transformation
Following earlier work [26, 45, 65, 71, 74] we employ the Fourier-Walsh transformation (Fig. 1a) to convert between fitness landscapes and their corresponding epistatic terms. This is a linear transformation written
Here is the vector of all 2L fitness values arranged in the canonical order defined by ascending L-bit binary numbers encoding the corresponding genotype with respect to the presence or absence of each mutation (e.g., [37]). (W is the traditional population genetics symbol for fitness.) Ψ is the Hadamard matrix, the unique, symmetric 2L × 2L matrix whose entries are either +1 or -1 and whose rows (and columns) are mutually orthogonal. Finally, is the resulting vector of 2L epistatic terms arranged in the canonical order defined by ascending L-bit binary numbers whose 1’ s indicate the corresponding subset of interacting loci. Fig. 1a illustrates this transformation. (See [54] for the relationship between this and other formalisms for computing epistatic terms.)
The orthogonality and symmetry of Ψ means that ΨT · Ψ = Ψ2 = 2LI, where I is the identity matrix. This means that, just as Eqn. 1 converts any landscape into its epistatic terms, so too can any vector of epistatic terms E be converted into its corresponding fitness landscape as WEtake advantage of this fact next.
2.3 Subsetting approximations of a fitness landscape
Given fitness function we now introduce subsetting approximations are constructed so as to have 0 ≤ m ≤ 2L of the components in (Eqn. 1) and the remaining 2L – m components set to zero. There are thus 22L subsetting approximations for any fitness function (corresponding to the power set of the 2L epistatic terms in . As a consequence of the orthogonality of the Fourier-Walsh transformation, the sum of squares distance between fitness function and subsetting approximation is minimized for given m if and only if uses the m largest components in absolute value of (Appendix 1). We denote these 0 ≤ m ≤ 2L best subsetting approximations . (Subsetting approximations defined by interaction order rather than absolute magnitude of epistatic terms were recently employed elsewhere [59].)
2.4 Quantifying the influence of epistatic terms on empirical fitness landscape topography
To examine the influence of epistasis on fitness landscape topography as a function of epistatic order, we first used Eqn. 1 to compute for each gleaned from the literature (§2.7). For each 1 ≤ m ≤ 2L, we then iteratively constructed each Finally, for each m we recorded the residual variance between and (minimized by this subsetting approximation; §2.3), together with the epistatic order of the mth-largest component of . Fig. 1b illustrates this process.
2.5 Statistics
Our hypothesis is that the influence of an epistatic term on the fitness landscape should decline with epistatic order. Put another way, we expected that after sorting the elements of (Eqn. 1) by their absolute magnitudes, the associated epistatic orders should be represented by a vector of 2L integers that reads:
Specifically, this vector consists first of one zero followed by L ones, twos and in general k’ s for all 1 ≤ k ≤ 2L.
We tested this hypothesis for each dataset by first computing Kendall’ s τb correlation coefficient [31] between this expectation and the epistatic orders observed among the elements in sorted by absolute magnitude. τb is one (negative one) when the observed epistatic orders are perfectly correlated(anticorrelated) with expectation, and zero when they are uncorrelated. Note that Kendall’ s τb statistic is appropriate because it accommodates ties. For studies thatalso reported experimental variance, we computed the correlation coefficient after discarding the epistatic orders of all j elements in that reduced residual variance by less than experimental variance (Fig. 1b, Table 1) as well as the last j epistatic order values in our expectation (Eqn. 2).
For each dataset, we then used a permutation test to test the null hypothesis that the corresponding correlation coefficient is zero. Specifically, each dataset is characterized by some number of epistatic terms: 2L in cases where no experimental variance estimate is provided, or 2L – j in cases where we were able to identify non-significant epistatic components (see previous paragraph and Table 1). For each of n= 10,000 replicates, we computed the rank correlation coefficient between two random permutations of this number (2L or 2L – j) of the epistatic order values drawn from Eqn. 2 for given L. We then sorted correlation coefficients, and the uncorrected P value reported for each dataset (Table 1) was taken as the fraction of permutations in which a correlation coefficient greater than or equal to the empirical value was observed (Fig. 1c). Thus, ours is a one-tailed test of the hypothesis that no positive correlation is present.
We used the Bonferroni-Holm method [28] to correct for multiple tests. In addition, under the null hypothesis that epistatic orders are uncorrelated with our naïve expectation, the distribution of P values observed across datasets should be uniformly distributed. We tested this hypothesis with a G-test after binning counts of empirically observed P values. We assessed statistical significance relative to the χ2 distribution [61].
2.6 Computing Kendall’ s correlation coefficient with aggregated higher-order epistatic terms
We also computed Kendall’ s correlation coefficient for the sequence of epistatic orders previously reported (Fig. 2 in [49]), in which third- and higher-order epistatic terms were not distinguished. To do this we coded all observed epistatic terms in this aggregate group as third-order. Those authors found that the residual variance in a model containing just 70 epistatic terms was roughly equal to the experimental variance. Thus, the analog to Eqn. 2 now contains one zero, followed by six ones, threes. Statistical significance was again assessed using a permutation test (n = 10,000) using this modified expectation in place of Eqn. 2.
2.7 Empirical datasets
To compute all 2L epistatic terms in a fitness landscape defined over L biallelic loci requires data on the fitness values (or suitable proxy) for each of the corresponding 2L genotypes. We previously designated such datasets combinatorially complete [74], and the datasets analyzed here are shown in Table 1. Several datasets [4, 38, 47, 49] had a few loci with cardinality greater than two. In these cases, we examined one “ slice” through the landscape defined by randomly choosing just two alleles at those loci.
Several studies examined multiple phenotypes for a single set of mutations, and follow-up studies sometimes presented additional phenotypes for a previously described set of mutations. Those cases are enumerated in Table 2; for each set of mutations we randomly sampled just one phenotype. Table 2 also lists all combinatorially complete datasets we know that are defined over loci with cardinality greater than two. These were excluded here because the Fourier-Walsh framework doesn’ t readily generalize to higher cardinalities.
Following [74], datasets reporting growth rates [4, 11, 14, 16, 23, 25, 77] or drug-resistance phenotypes [9, 37, 42, 43, 49, 73] were log-transformed before analysis. Following [49], negative two was used in place of log-transformed values when growth rate or drug resistance phenotypes of zero were observed. (In all cases, this is roughly one log order smaller than the smallest non-zero log-transformed value.) In cases where only mean and experimental variances (but not individual replicate observations) were provided, log transformations were approximated by Taylor expansions:. In cases where only means(but not variances) were provided, log transformations were approximated as .
Following [49], for studies in which experimental variance estimates were provided, we recorded this quantity as a fraction of the total model variance. In one case [9], standard error was reported as standard error over “ at least” two replicates; we therefore assumed n = 2 for each observation in that dataset. In one case [32], 95% experimental confidence intervals were reported, so variance estimates were computed under the assumption of normally distributed noise as s2 = (n.CI95/1.96)2.
2.8 Data and software archiving
Input data files, together with purpose-built MatLab code to perform all analyses described are archived at https://github.com/weinreichlab/JStatPhys2017.Kendall’ s τb correlation coefficient was computed using MatLab code developed elsewhere [10].
3 Results
Epistasis can have profound consequences at many levels of biological organization [51, 57, 66, 79]. Here we were particularly interested in the possibility that higher-order epistasis might limit natural selection’ s ability to increase fitness. We indirectly assessed this hypothesis by examining the influence of epistasis on empirical fitness landscape topography as a function of epistatic order (Table 1). As these landscapes are occupied by extant populations that are themselves the product of natural selection, we reasoned that they may be enriched for properties that allow selection to operate efficiently. (See §1.2.)
This study was originally stimulated by Fig. 2 in Palmer et al., 2015 [49], which examined six mutations in the dihydrofolate reductase (DHFR) gene of E. coli that contribute to increased resistance to an antimicrobial called pyrimethamine. In that analysis, particular second- and third-order interactions were the third- and second-most influential epistatic terms for fitness landscape topography respectively.Indeed, just two of the first ten most influential epistatic terms were first-order, and in aggregate first-order terms explained just ~28% of the variance in fitness across the landscape. These results seem to challenge the hypothesis outlined in the previous paragraph, and we therefore sought to test its generality using published data from other systems.
Fig. 1 illustrates the application of our analytic pipeline (see Methods) to these same data. Our Fig. 1b closely recapitulates Fig 2a in Palmer et al. 2015 [49]. While the precise sequence of epistatic terms differs slightly (likely because the previous study employed a subtly different framework for computing epistatic terms), higher-order epistatic interactions are again responsible for some the largest reductions in residual variance. Indeed, as previously observed, just two of the first ten terms are first-order, and in aggregate and first-order terms again explain just ~28% of the variance in the data (Table 3a, compare the first two columns with Fig. 2b in [49]).Importantly however, Fig. 1c illustrates that we find a significant, positive correlation between expectation (Eqn. 2) and the observed influence of epistatic terms on landscape topography as a function of their order (τb = 0.1980, P = 0.0377).
We next applied our pipeline to 15 other published, combinatorially complete datasets. Results are summarized in Table 1 and shown graphically in Fig. S1. Out of all 16 datasets examined, 14 exhibit a significantly positive correlation between observation and the expectation, and eight of these remain significant after Bonferroni correction for multiple tests. Moreover, across datasets Table 1 exhibits a bias toward small P values. Under the null hypothesis (no significant correlation with expectation), we would expect a uniform distribution of P values. Instead, the observed distribution is sharply and significantly skewed toward small values (Fig. 2, G = 143.77, Pd.f.=5 ≪ 0.01).
4 Discussion
Using a novel analytic pipeline (Fig. 1), we have examined 16 published, combinatorially complete datasets. This analysis broadly confirms our intuition that the influence of epistatic terms on empirical fitness landscape topography should decline with order, i.e., with the number of interacting mutations. While considerable heterogeneity in effect exists among datasets (Table 1), eight of these 16 datasets exhibit a Bonferroni-corrected, significantly positive correlation with expectation (Eqn. 2). And across all 16 datasets, we find a sharp bias toward significant P values (Fig. 2). Nor is there any correlation between the size of the dataset and uncorrected P value (not shown), suggesting that low statistical power is unlikely to contribute to the overall picture.
The relative magnitudes of epistatic terms depend on the underlying fitness scale employed [33, 74]. Although we log-transforming growth rate and drug resistance data (see §2.7), we have otherwise overlooked this fact. Recently, approaches for systematically rescaling data to minimize higher-order epistatic effects have been introduced [58] (see also [45, 69]). Applications of such methods would certainly have quantitative consequences for results presented here. However, because these approaches (on average) reduce higher-order epistatic terms, we believe this omission renders our conclusions conservative.
We also acknowledge that we failed to honor experimental uncertainty in the sequence of epistatic orders observed, which would almost certainly weaken the signal reported in Table 1. However, our intuition is that this effect would be modest, and moreover, only applies to the nine (of 16) datasets for which experimental uncertainty estimates are available.
4.1 The combinatorics of higher-order epistasis
This work was originally stimulated by a previous study [49] that examined six mutations in the DHFR gene responsible for increased pyrimethamine resistance in E. coli. Results summarized in Fig. 2 of that study called into question the intuition outlined in §1.2, that higher-order epistasis should only modestly influence naturally occurring fitness landscapes. And the salient features of that figure were recapitulated by our treatment (Fig. 1b, Table 3a).
However, our statistical analysis reveals a strong positive correlation between epistatic influence on fitness topography as a function of epistatic order and that suggested by our evolutionary intuition (Fig. 1c). And applying our analytic pipeline to the aggregated epistatic orders (see §2.6) reported in Fig. 2a of the previous study [49], we again find a significant positive correlation between observed orders and expectation derived from the intuition outlined above (tb = 0.1056, uncorrected P = 0.0107). Thus, in this system the substantial influence of some high-order epistatic terms not inconsistent with the idea that high-order epistatic terms should in general only modestly contribute to fitness topography.
The resolution to this puzzle resides in the combinatoric number of epistatic terms. As noted above, given L biallelic loci there are epistatic coefficients of order k, and this quantity grows almost exponentially for k ≪L. Indeed, after normalizing the summed influence of all epistatic terms of order k by the number of such terms, we observe that the per-term effect declines almost monotonically in this dataset (Table 3a; see also [74]). This is both as expected on the basis of the evolutionary intuition outlined above and consistent with the statistical analysis of the data in Fig. 1c. A similar picture emerges in our analysis of 5-epi-aristolochene production by sesquiterpene synthase mutants [47]. A number high-order epistatic terms again explain substantial amounts of variance (Fig. S1m), despite a modestly positive correlation coefficient (~0.2) with Eqn. 2 (Table 1). Here again, the resolution reflects the combinatorics of epistatic terms, and the mean per-order effect also declines (Table 3b).
This line of thinking is closely related to the Fourier spectrum of a fitness landscape [45, 63], namely the sum of squared epistatic coefficients as a function of interaction order. (This connection derives formally from Appendix 1, which implies that the squared magnitude of each epistatic coefficient is monotonic in its influence on landscape topography.) The Fourier spectrum is proportional to the binomial coefficient when each genotype’ s fitness is identically and independently distributed. This follows from the fact that on such landscapes all epistatic coefficients are also i.i.d., together with the combinatorics outlined in the previous paragraph. But as already anticipated by results in Table 3, Fourier spectra for both the DHFR and sequiterpine synthase datasets are sharply shifted toward lower-order terms (not shown), as has previously been reported for both sesquiterpene synthase and several others biological datasets [45].
Nevertheless, declining average epistatic effects notwithstanding we find many examples of specific epistatic terms with anomalously large explanatory effects in many of the datasets examined here (Fig. S1). We suggest that these may reflect important mechanistic interactions among those particular mutations in the underlying biology of the system, thus representing potentially fruitful entry points for the molecular biologist [17].
4.2 An anthropic perspective and the limits to inferring the effects of natural selection
This study began from the intuition that epistatic influence on empirical fitness landscape topography should decline with epistatic order (see §1.2). This notion rests on three ideas. First, epistasis in general constrains natural selection’ s efficiency [75, 79], and we further speculated that higher-order epistasis might be particularly influential in this respect (see also [59]). Yet datasets such as those in Tables 1 and 2 represent surveys in the genetically local vicinity of extant biological populations. And since biological populations are the product of natural selection, we supposed that they can only persist and succeed if the properties of the fitness landscape enable the efficient action of natural selection [19]. Hence our intuition reflects a line of reasoning that is analogous to the anthropic principle. The anthropic principle holds that the properties of our universe (e.g., the value of its physical constants) should not be regarded as random draws from the space of all conceivable universes. On the contrary, the fact of our observation of those properties sharply constrains our universe to be drawn from the subset of possibilities that are capable of supporting sentient perception.
Does this interpretation of the present analysis imply that natural selection is responsible for the apparently moderate influence of epistatic effects observed? In principle, we might imagine that the strength of epistasis will vary across the exponentially large reaches of sequence space. And if this were true, we might further expect that natural selection would favor populations that find their way to low-epistasis regions on the landscape in preference to those evolving in high-epistasis regions. This would follow if locally reduced epistatic effects sufficiently improved the efficiency of natural selection to allow such populations to outcompete others evolving in regions of the fitness landscape lacking these favorable features. Such population-level competition is sometimes called lineage selection in population genetics [1].
Recently, some support for these ideas has emerged [15, 69]. This work begins from the premise that the epistatic structure of a locally sampled fitness landscapes reflects the way that the constituent mutations were selected by investigators. For example, mutations jointly selected for their ability to confer large fitness gains exhibit more modest epistatic interactions than do mutations whose joint effect is unknown [15]. This finding is at least consistent with lineage selection for reduced epistatic effects.
Critically however, we remain broadly ignorant about the levels of epistasis that exist at random locations in sequence space. Although ever-larger local surveys in sequence space are now becoming possible (e.g., [7, 24, 60]), these are still limited to sparse samples with radii (L) of just tens of point mutations. Moreover, the very fact of low (average) high-order epistatic components observed here seems to imply some autocorrelation in epistatic effects across sequence space. This suggests that much wider surveys will be required to develop a sense of what epistasis looks like “ on average.” We thus conclude that for the time being we have only very limited insight into the influence that natural selection has had on the signals detected here.
Indeed, the ability to compare the genetics of what is possible with the genetics of what is observed in nature is generically essential to any demonstration of natural selection [62]. For example, the technological advances described in the previous paragraph are beginning to provide direct experimental access to what is possible at the scale of a handful of mutations, rapidly advancing our ability to detect natural selection acting on first-order mutational effects within individual proteins [7, 21, 62, 67, Wylie et al., in prep]. We look forward to being able to make analogous inferences regarding natural selection’ s influence on epistatic effects in natural populations. (Though these sorts of questions can already be explored using toy fitness landscapes, e.g., [18, 19, 35, 78].)
4.3 Epistasis and the efficiency of natural selection
Throughout, we have assumed that high-order epistatic interactions reduce the efficiency of natural selection (§1.2). Our observation that the influence of epistatic terms on naturally occurring fitness landscapes declines with epistatic order represents an indirect test of this idea. However, we lack a detailed theoretical understanding of this connection.
Perhaps the most well-developed results concern the influence of epistasis on the selective accessibility of mutational trajectories to high fitness genotypes. First, sign epistasis means that the sign of the fitness of a mutation of selection varies with genetic background [75], and it renders selectively inaccessible at least some mutational trajectories to high fitness (e.g., [73]). But connections between sign epistasis and epistatic order are only now being developed [13]. Second, a subsetting approach similar to ours (§2.3) was recently used to examine the influence of epistatic interactions selectively accessible mutational trajectories to high fitness genotypes [59] in six of the datasets described here. Those authors found that higher-order terms indeed substantially alter the identity of selectively favored mutational trajectories to high-fitness genotypes, as well as their probabilities of realization. Further and consistent with findings here, that study also noted that the absolute magnitude of epistatic terms had an even larger effect on realized mutational trajectories than did their interaction order.
However, epistasis has long been understood to influence not just the selectively accessibility of high fitness genotypes but also the pace at which natural selection both increases the frequency of beneficial mutations (e.g., [20]) and at which it purges deleterious mutations (e.g., [34]). This work is closely related to the role that genetic recombination can play in “ unlocking” epistatically interacting mutations (e.g., [5, 44]). To our knowledge the relationship between these effects and higher-order epistasis remains entirely unexplored.
In addition, we have only quantitatively examined the sequence of epistatic orders sorted by explanatory power (Fig. 1c). Thus, a great deal of information present in these data (e.g., the slopes in Figs. 1b and S1) remains to be examined. And of course, the number and size of available combinatorially complete datasets continues to grow, motivating further work in this regard. It seems reasonable to suppose that the development and testing of more nuanced theoretical predictions may be possible using data of the sort examined here.
Finally, we note that the Fourier-Walsh framework employed here depends on the availability of combinatorially complete datasets. But the experimental demands of this approach grow exponentially with the number of mutations examined. This fact sharply limits the scalability of analytic pipelines like ours. Recently, theoretical progress has been made in the analysis of less-than-complete datasets [6, 13], and older work has also explored this idea [27, 64]. Theory that allows inferences using sparse datasets is likely to be a key advance in our ability explore broad, evolutionarily fascinating questions such as those considered here.
Acknowledgements
We are grateful to Tony Dean, David Hall, Sebastian Matuszewski, and Vaughn Cooper for providing raw data files. We also acknowledge constructive feedback on an earlier draft of this manuscript from Guillaume Achaz, Kristina Crona, Inês Fragata, Joachim Krug, Sebastian Matuszewski and Brandon Ogbunugafor. DMW is supported in part by National Science Foundation Grant DEB-1556300 and National institutes of Health Grant R01GM095728. RBH is supported in part by the National Science Foundation under Cooperative Agreement No. DBI-0939454.
Appendix 1 The explanatory power of Fourier-Walsh coefficients is monotonic in their absolute magnitude
Assume two fitness functions defined over L biallelic loci are represented as column vectors and with Fourier-Walsh coefficients and . Define the sum of squares distance between and as . where wi and xiare the ith components of and respectively.
Sum of squares distance equivalence
Proof: By definition where Ψ is the Hadamard matrix (see §2.2). Therefore
But recall that Ψ TΨ = 2LI, where I is the identity matrix. Thus
An interesting property of the Hadamard matrix is that ΨT = 2L Ψ-1. Without the 2L this equality is the hallmark of a rotational transformation. This means that Fourier-Walsh coefficients are simply the result of a high dimensional axis rotation of the coordinates of function space, together with a uniform contraction. This provides intuition into Theorem 1: rotating the space and contracting it uniformly only changes the distance between two vectors in the space by the constant of contraction.
The subsetting approximation that minimizes the sum of squares distance to function is the one whose uses the m largest components in absolute value in .
Proof: By Theorem 1, the sum of squares distance between and is , which means that we can equivalently solve the minimization problem on either side of the equality. And trivially, the right-hand side is minimized when the m nonzero components in are the m largest components in absolute value in (The squaring of differences in epistatic terms in the definition of removes the significance of their sign.)
References
- 1.↵
- 2.↵
- 3.
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.
- 53.↵
- 54.↵
- 55.↵
- 56.
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵