TY - JOUR T1 - Generalization of the Ewens sampling formula to arbitrary fitness landscapes JF - bioRxiv DO - 10.1101/065011 SP - 065011 AU - Pavel Khromov AU - Constantin D. Malliaris AU - Alexandre V. Morozov Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/07/21/065011.abstract N2 - In considering evolution of transcribed regions, regulatory modules, and other genomic loci of interest, we are often faced with a situation in which the number of allelic states greatly exceeds the population size. In this limit, the population eventually adopts a steady state characterized by mutation-selection-drift balance. Although new alleles continue to be explored through mutation, the statistics of the population, and in particular the probabilities of seeing specific allelic configurations in samples taken from a population, do not change with time. In the absence of selection, probabilities of allelic configurations are given by the Ewens sampling formula, widely used in population genetics to detect deviations from neutrality. Here we develop an extension of this formula to arbitrary, possibly epistatic, fitness landscapes. Although our approach is general, we focus on the class of landscapes in which alleles are grouped into two, three, or several fitness states. This class of landscapes yields sampling probabilities that are computationally more tractable, and can form a basis for the inference of selection signatures from sequence data. We demonstrate that, for a sizeable range of mutation rates and selection coefficients, the steady-state allelic diversity is not neutral. Therefore, it may be used to infer selection coefficients, as well as other key evolutionary parameters, using high-throughput sequencing of evolving populations to collect data on locus polymorphisms. We also carry out numerical investigation of various approximations involved in deriving our sampling formulas, such as the infinite allele limit and the “full connectivity” assumption in which each allele can mutate into any other allele. We find that our theory remains sufficiently accurate even if these assumptions are relaxed. Thus, our framework establishes a theoretical foundation for inferring selection signatures from samples of sequences produced by evolution on epistatic fitness landscapes. ER -