Abstract
Modern SNP genotyping technologies allow to measure the relative abundance of different alleles for a given locus, and consequently to estimate their allele dosage, opening a new road for genetic studies in autopolyploids. Despite advances in genetic linkage analysis in autotetraploids, there is a lack of statistical models to perform linkage analysis in organisms with higher ploidy levels. In this paper, we present a statistical method to estimate recombination fractions and infer linkage phases in full-sib populations of autopolyploid species with even ploidy levels in a sequence of SNP markers using hidden Markov models. Our method uses efficient two-point procedures to reduce the search space for the best linkage phase configuration and reestimates the final parameters using maximum-likelihood estimation of the Markov chain. To evaluate the method, and demonstrate its properties, we rely on simulations of autotetraploid, autohexaploid and autooctaploid populations. The results show the reliability of our approach, including situations with complex linkage phase scenarios in hexaploid and octaploid populations.
Author summary In this paper we present a multilocus complete solution based in hidden Markov models to estimate recombination fractions and infer the linkage phase configuration in full-sib mapping populations with even ploidy levels under random chromosome segregation. We also present an efficient pairwise loci analysis to be used in cases were the multilocus analysis becomes compute-intensive.
Introduction
Polyploids are organisms with more than two sets of chromosomes. They are very important in agriculture and play a fundamental role in evolutionary processes, such as differentiation of species [1]. The number of sets of chromosomes in an organism is called ploidy level. These multiple sets of chromosomes in a polyploid can originate from the combination of chromosomes from different, but related species, or from the duplication of chromosomes from the same species [2, 3]. In the first scenario, they are called allopolyploids; in the second, autopolyploids. Another way to characterize polyploid organisms is according to their pattern of inheritance. In general, allopolyploids exhibits disomic segregation, since homologous chromosomes have more affinity than homeologous chromosomes and tend to form preferential bivalents within each sub-genome [4]. Autopolyploids, however, exhibit more than two homologous chromosomes per homology group. Thus, during the meiosis, they can form either bivalents or multivalents [4, 5]. The expected segregation ratios in autopolyploids vary depending on the type of chromosome configuration that the organism presents during meiosis. If the chromosomes pair randomly, the segregation is called polysomic [6-9]. In addition, the homologous chromosomes may have preferential pairing, which can vary from complete preferential (disomic segregation) to complete random (polysomic segregation). Since the molecular mechanics of polyploid organisms are quite complex, this rigid dichotomy is often broken, and organisms can exhibit intermediate modes of inheritance [4, 10]. Throughout this paper, the term autopolyploid (or autotetraploid, autohexaploid, etc.) will refer to polyploid organisms that exhibit polysomic segregation.
Despite all advances in genetic studies in autotetraploids [11-21], there is still a shortage of statistical methods to address organisms with higher ploidy levels, such as sweet potato [22-24], sugarcane [25, 26], some ornamental flowers and forage crops (reviewed in [27]). In this work, we denote as high-level autopolyploids those autopolyploid organisms with ploidy level greater than four. A fundamental class of statistical methods that are lagged behind in high-level autopolyploid studies is the construction of genetic maps. A reliable genetic map is a crucial step in quantitative trait loci (QTL) analysis, as well as the assembly of reference genomes and the study of evolutionary processes [28-30]. Although understanding the concept of genetic mapping is rather easy, the construction of such maps in high-level autopolyploids is challenging. Even under bivalent pairing, there is a large number of possible configurations during the meiosis, and this number gets exponentially larger as the ploidy level increases. Denoting m as the ploidy level, it is possible to find up to m different alleles for a locus in one individual. Furthermore, if some of those alleles are not distinguishable, it is necessary to consider the number of copies of each different allelic form, also known as allele dosage. Finally, depending on the marker system used to access the genotypic information, in the vast majority of cases, it is not possible to obtain the complete information about a particular locus.
The construction of a genetic map in a full-sib population can be summarized in five basic steps: i) estimation of pairwise recombination fractions and associated LOD Scores; ii) separation of markers into linkage groups; iii) order markers within each linkage group using an optimization technique; iv) parental phasing, recombination fraction update and likelihood computation and v) if the order is optimal, the map is complete, otherwise, return to step iii. Historically, genetic maps in high-level autopolyploids have been constructed using only alleles present in one homologous chromosome, called single-dose markers [31, 32]. In a full-sib population, these markers segregate in a 1:1 ratio (if they are present only in one parent), or in a 3:1 ratio (if present in both parents). Given this level of simplification, it is possible to use the five-step procedure coupled with a standard software suitable for backcross diploid populations. Nevertheless, it is well accepted that the use of single-dose markers imposes limitations on the construction of adequate genetic maps. These approaches sub-sample the genome [19, 26], which precludes further consideration of multiallelic effects in models for QTL mapping and subsequent studies. Moreover, there is low statistical power to detect linkage when markers are in repulsion phase configurations [31, 33]. Although some authors have addressed this problem by including multiple dose markers when constructing genetic maps and performing QTL mapping [33, 34], the limitations on the genotyping technologies at the time required that the allelic dosage had to be inferred based on expected segregation rates. Because of the high amount of hidden information imposed by marker systems on those studies [31, 33], the estimation of recombination fraction between multi-dose markers was highly impaired.
Quantitative genotyping technologies for single nucleotide polymorphism (SNPs) evaluation have opened the door for further genetic mapping studies in high-level autopolyploids. It is now possible to measure the abundance of specific alleles within a locus in a polyploid genome [19, 26, 36-39]. This technology, combined with the genotypic distribution in the population [37], makes it possible to infer the allelic dosage by using the ratio between the abundances of the two alternative alleles. Once the dosage of the markers is estimated, the construction of linkage maps can be significantly improved by taking this information into account. [19] and [40] presented works that take into consideration the dosage of quantitative SNP data both in linkage studies and QTL mapping for autotetraploids.
Genetic linkage maps can be constructed based on two-point or multipoint estimates of the recombination fraction. Two-point methods use information of pairs of markers, and even though they are less computationally demanding than multipoint methods, they require a higher amount of information in the markers to provide reliable results. Multipoint approaches, instead, use information of multiple makers present in a linkage group, increasing the statistical efficiency of the analysis [17, 41, 42, 53]. This feature is particularly important in polyploid linkage analysis, where markers are mostly partially informative. One widely used procedure to obtain multipoint estimates is the hidden Markov model (HMM) [41]. The construction of the genetic map using this method provides the estimates of the recombination fractions between all adjacent markers in a linkage group, as well the multipoint likelihood, which has been shown to be an excellent criterion to evaluate and compare linkage phase configurations and orders of makers [42]. [17] presented a statistical framework in which HMMs were applied to reconstruct genetic linkage maps, but it was limited to autotetraploids. Recently, [35] constructed an ultra-dense integrated linkage map for hexaploid chrysanthemum using two-point analysis. However, there is a lack of multipoint procedures that can handle cases where less marker information is available in high ploidy levels.
The main challenges we address in this paper are the inference of the haplotypes of the multiple homologous chromosomes and the multipoint estimation of recombination fractions in high-level polyploids. Although [21] proposed a probabilistic multilocus haplotype reconstruction model for autotetraploids considering double reduction, this remains as an open question for organisms with higher ploidy levels. Our method relies on an HMM and is developed for species with even ploidy levels under random chromosome segregation (complete polysomic inheritance). We also present a two-point method which is capable of dealing with hundreds of markers even in high ploidy level scenarios. Hence, we are proposing solutions for steps i and iv in high-level autopolyploids. Step ii is straightforward from step i using clusterization algorithms, as proposed by [50]. Even though step iii is a challenging task in genetic mapping, it can be addressed using pairwise recombination fractions or the resulting likelihood of the Markov model as it has been proposed by several studies [43-49]. To evaluate our method, and to show its properties, we rely on simulations of autotetraploid, autohexaploid, and autooctaploid data. The R computer codes to reproduce all simulations and analysis are publicly available.
Methods
In this section, we define the notation used throughout this article and present the probabilistic model for the gamete formation in autopolyploids. Then, we move to the calculation of the transition probabilities for adjacent marker loci (Eq 6) and follow to the initial state (Eq 7) and emission probability distributions (Eqs 8 and 9) which are fundamental in an HMM model. We conclude by explaining the complexity of estimating linkage phases between markers, presenting an efficient two-point algorithm that simplifies the problem in a way that allows the phasing to be inferred using real data.
Notation
Consider one homology linkage group in a mapping population derived from a cross between two autopolyploid individuals P and Q with the same ploidy level (full-sib family). The ploidy level is denoted by m, and can be any even number greater than zero. Let the vectors and , and and , i = 1, …, m, denote the genotype of two adjacent multiallelic loci k and k + 1 in P and Q, respectively. The superscript i indicates one of the possible alleles for the loci, and each locus has m different alleles in each parent. For example, for a cross between two autohexaploid individuals, ; similarly, this can be done for and . All alleles denoted by the same superscript number are in the same homologous chromosome (e.g., and are in homologous chromosome 1, etc).
The following assumptions are made to ensure random chromosome segregation [6, 8] and no double reduction [51]: i) there is only formation of bivalents during the meiosis; ii) there is no preferential pairing during the formation of bivalents; iii) all bivalents have the same recombination fraction between loci k and k + 1; iv) bivalents are independent and v) there is separation of sister chromatids during the meiosis II. Consequences of violations of these assumptions will be addressed later using simulations.
Bivalent formation
It occurs during meiosis I (more specifically, at the pachytene stage of prophase). In diploid cells, there is only one possible pairing configuration: two duplicated homologous from a homology group pair to form one bivalent. However, in autopolyploid cells, given the previous assumptions, the number of possible pairing configurations, i.e., the number of possible bivalent chromosomal pairing for a given homology group during meiosis is
The orientation of the bivalents does not affect the expected frequencies of each gamete type, and therefore will not be considered. For example, for an autotetraploid individual, there are two bivalents and three possible bivalent configurations. Homologous chromosome pair as 1 with 2, and 3 with 4; or, 1 with 3 and 2 with 4; or 1 with 4 and 2 with 3 [52]. We denote Ψ = {ψj}, j = 1, …, wm a set of all bivalent configurations for a given ploidy level.
Expected gametic frequency for a given bivalent configuration
We will present the expected gametic frequencies considering parent P. Since parent Q undergoes a similar process, it is possible to combine the expected gametic frequencies to obtain the expected genotypic frequency in the full-sib population. Each of the bivalents obtained for a given configuration ψj can result in two types of chromosomes for loci k and k + 1: parental, which results from bivalents with zero or any other even number of recombinations between k and k + 1; and recombinants, which results from bivalents with any odd number of recombinations. As presented by [34], the probabilities of all chromosome types for any single bivalent can be represented always as where rk is the recombination fraction between k and k + 1, i ≠ i′. For a given configuration ψj, the expected frequencies for all possible gametes derived from that configuration is where ⊗ denotes the Kronecker product of matrices and subscripts in V indicate the corresponding bivalent. All elements of this product are of the form where l denotes the number of total recombinant bivalents between loci k and k + 1, l ∈ {0, …, m/2}. From this, we can define the probability of observing any gamete (for two loci) given a bivalent configuration ψj as where vectors pk and pk+1 denote a subset of alleles present in and , respectively; {pk, pk+1} indicates a gamete for loci k and k + 1 from parental P. Consistent means that the gamete can be produced from bivalent configuration ψj. Notice that some gametes cannot be obtained from ψj once the bivalents are formed.
Since we assume that alleles with the same superscript are in the same homologous chromosome, l can be obtained by a simple examination of superscripts of elements contained in pk and pk+1. Consider, for example, ψ1 = {(1, 2), (3,4), (5,6)} (m = 6, Fig 1). If one observes and , the number of recombinant chromosomes is l = 2. Therefore, . On the other hand, , since it is impossible to obtain this gamete from configuration ψ1 i.e., it is not consistent with ψ1.
Gametic frequency unconditional to bivalent configurations
In reality ψj. is unknown, thus the conditional probability given by Eq (2) must be considered for all possible ψj.. The probability of observing a gamete {p k, pk+1}, unconditional to ψj, can be expressed as
It is important to notice that only a subset of Ψ is consistent with the observed gamete, and consequently Pr(pk, pk+1 | ψj) > 0 only for some ψj’s. Fig 2 shows a graphical representation of Eqs 2 and 3 for autohexaploid gametes.
The probability of observing a specific gamete is always the same for each ψj in this consistent subset (Eq 2). Therefore, under random pairing (assumption ii), our task reduces to finding the number of elements in this subset that are consistent with the observed gamete and multiply Pr(pk, pk+1 |ψj) Pr(ψj) by this number. The result is the probability of observing a gamete unconditional to the bivalent configuration.
For every gamete, l can change from zero to m/2 recombinant homologous chromosomes. The observed gamete is the result of homologous chromosomes that migrate to one pole of the cell at anaphase I. Since we are assuming that there is separation of sister chromatids during anaphase II, if l = 0 (all chromosomes are of parental type), there is no information about the pairing configuration of the homologous chromosomes that migrate to the opposite pole of the cell. In this situation, there are possible pairing configurations, and the number of possible ψj that can produce gametes with l = 0 is . Therefore, for l > 0, there are possible pairing configurations of parental chromosomes. For the remaining l recombinant chromosomes, the number of possible pairing configurations is l!. Thus, the total number of possible pairing configurations that can produce a specific gamete is . This is precisely the number of elements in the subset of Ψ consistent with the observed gamete. Given the assumption of no preferential pairing during the formation of bivalents, Pr, the probability of a gamete {pk, pk+1}, unconditional to ψj, can be simplified to
Map reconstruction via hidden Markov model
The construction of a genetic map involves the estimation of the genetic distance and order between markers within linkage groups. If the origin of the haplotypes (i.e., linkage phase) for the parents of the mapping population is unknown, it also needs to be estimated. For several years, hidden Markov models have been proven to be an excellent avenue for obtaining these estimates [17, 41, 42, 53]. The multipoint likelihood obtained using HMMs is employable as a criterion to compare marker orders and judge which one is best, and also to provide a reliable estimation of recombination fraction and linkage phases. [54] defines an HMM as a generative process composed of three well-defined probability distributions: transition, initial state and emission. In genetic mapping context, the transition probability distribution is defined as the probability of having a particular genotype at position k +1, given the genotype at position k. Using Eq (4) the gametic transition probabilities Pr(pk+1|pk), or the conditional probability of a gamete genotype at loci k + 1 given the gamete genotype at loci k, is simply
Under random chromosome segregation, both pk and pk+1 can have different genotypes. Let denote all possible genotypes that pk can assume for loci k. Also, assume that genotypes in are arranged according to the lexicographical order of their superscripts. For example, in an autotetraploid for locus k. After some simplifications (see S1 Appendix) the transition probability, i.e., the conditional probability of a gametic genotype in locus k + 1 given the gametic genotype in locus k, is where . The initial state and the emission probability distributions will be addressed in the next section (Eqs 7 to 9).
Including information of both parents
Any given individual in a full-sib population is formed by the union of gametes from both parents, P and Q. Each parent can form different gametes for locus k. Since the formation of gametes in both parents is independent, the genotypic transition probability distribution can be written as where · denotes the genotype of an individual derived from the union of gametes and at locus k. The same reasoning applies to , and . lP and lQ denote the number of recombinant bivalents between loci k and k + 1 in parents P and Q, respectively. Let denote the number of possible genotypes derived from the cross between individuals P and Q. For simplification and without loss of generality, let . For a comprehensive example of the transition probabilities and the indexation used in Eq. 6, see Table 8 in S3 Appendix.
Given a ploidy level m and a recombination fraction rk, the only information required to obtain tk(j, j′) in Eq (6) is lP and lQ. Since the genotypes in and are arranged according to the lexicographical order of their superscripts, it is possible to obtain (lP, lQ) for any given pair (j, j′) using the algorithm presented in S2 Appendix. Although the number of possible transitions between positions k and k + 1 is (gm)2, which can be a very large number even for modest ploidy levels, it is possible to obtain the transition between any specific genotypes in j and j′ without computing the entirety of the transition space.
The initial state distribution is the probability of observing a specific genotype. Given the assumption that there is no preferential pairing during the formation of bivalents, a uniform probability density function can be employed as the initial state probability function
To this point, both transition and initial state distributions consider different allelic variants for all m homologous chromosomes in both parents. This scenario can only be achieved when using fully informative markers. In reality, autopolyploid species may have the same allelic variant in some homologous chromosomes. Besides, even if all homologous have different allelic forms, modern genotyping platforms are usually capable of detecting polymorphisms at the nucleotide level (SNPs), which are essentially biallelic. Due to this lack of identity between the observed data and the full transition space, we make use of the emission function, which is defined as the probability of observing a molecular phenotype given a genotype .
The detection of the allelic variants in modern genotyping platforms is based on the abundance of different alternative nucleotides. In the autopolyploid setting, this can be translated as the dosage of a SNP at a specific locus. The dosage of a SNP can be estimated using the ratio between the abundance of its two allelic forms. Several methods were proposed to perform this task including [36], [37] and [38]. Here we introduce a biallelic derivation of the emission probability distribution. Although the function presented here use biallelic information, other distributions can be derived for partial informative multiallelic marker systems following the same reasoning.
Let denote the observed dosage of one allelic form in locus k for parents P and Q, respectively. The choice of the allelic form denoted by is arbitrary, as long as the same allelic form is used in . The dosage observed in parent P can be originated from alleles present in of the m homologous chromosomes. Let denote a set of size containing all possible subsets in that originate the observed dosage . The operator #{.} is the cardinality of a set. The same reasoning applies for . For instance, in an autotetraploid, if , the three doses present in locus k can be derived from four distinct subsets . Given two particular subsets and in and , each one of the gm genotypic states in the full transition space can be associated to a dosage. The dosage associated to the j-th state is obtained by counting the number of alleles present in the intersection between the parental allelic set and . Thus, the emission function can be defined as where and ϵ denotes the global genotype error rate. In addition to the punctual estimate of the dosage, the genotyping calling methods cited above also provide the probability distribution of the dosages for a particular marker for all individuals of the biparental population. If this information is available, a more general emission function can be derived. Instead of modeling a global error rate ϵ, we use the prior information provided by the genotyping calling procedure. Let denote the probability distribution vector associated to the dosages 0, …, m at position k for a particular individual in the biparental population. For example, denotes a tetraploid individual with probabilities and of having one, two and three doses, respectively, and zero for the remaining ones. Then, the emission probability function can be written as
In this case, the observation O can be any dosage from 0 to m and the information about the genotypes will be contained in the probability distribution of the dosages πk. Thus, the probability of observing any dosage given a genotype associated to a particular dosage δ(k, j) can be obtained by simply assessing the corresponding value in the probability distribution provided by the genotype calling procedure. Notice that Eq 8 can be reduced to Eq 9 using the appropriate πk. For example, in autotetraploids, when the observed dosage for locus k is one, . Moreover, for missing values, it is possible to use the probability distribution of the genotypic classes under polysomic segregation, as presented by [37].
Multipoint likelihood and the estimation of recombination fraction
Suppose there are z markers in a homology group in a known order represented by M1, …, Mk, …, Mz. Let r = (r1, …, rk, …, rz-1) denote the recombination fraction vector between all marker intervals in this sequence. Also, assume linkage phase configurations in parents P and Q denoted respectively by and . The sequence of observations for the z markers is denoted by (O1, …, Ok, …, Oz) and its underlying probability distributions is denoted by Π = (π1, …, πk, …, πz). The likelihood of M1, …, Mk, …, Mz can be obtained using Eqs (6), (7) and (9) following the classical forward procedure [54]. Let denote the probability of the partial observation sequence (O1, …, Ok) and genotype given the sequence of recombination fractions r, the linkage phase configurations ΦP and ΦQ and the probability distributions for the sequence of observations Π. The forward procedure follows the steps below:
1. Initialization:
2. Induction: where k =1, …, z – 1 and j′ = 1, …, gm
3. Termination:
Then, the likelihood of the model is defined as where n is the number of individuals in the full-sib population, O1,i, …, Oz,i is the sequence of marker observations for individual i and Πi is a (m +1) × z matrix where the k-th column denotes the probability distributions associated to the marker Mk, individual i. The multipoint maximum likelihood estimate of r can be obtained using the forward-backward procedure coupled with the EM algorithm [54]. For the backward procedure, consider the variable as the probability of the partial observation sequence from k + 1 to z, given the genotype , the recombination fraction vector r, the linkage phase configurations ΦP and ΦQ and the probability distributions for the sequence of observations Π. The solution to βk(j) was also described by [54] as follows:
1. Initialization:
2. Induction: where k = z – 1, z – 2, ···, 1 and j = 1, …, gm
To estimate the recombination fraction for all intervals in the marker sequence we need to define ξk(j, j′) as the probability of state at position k and state at position k +1 given the sequence of observations O1, … Oz and their underlying probability distributions Π, the recombination fraction vector r and the linkage phase configurations ΦP and ΦQ
The recombination frequency rk can be estimated through an iterative process using where ξk(j, j′ | rs) is calculated for individual is the proportion of recombinations between markers k and k + 1 for individuals with genotypes and and rs is the vector of recombination fractions in the iteration (s) and rs+1 is the updated recombination fraction vector [55].
Estimation of linkage phase
Let the Cartesian product denotes a set containing all possible linkage phase configurations in parent P. Also, let , , denote a set containing all possible linkage phase configurations in both parents. The probability of the linkage phase configurations can be obtained using Bayes’ rule where O is an array containing the observation for z markers in n individuals, and Π is the underlying probability distribution for all marker observations. Since the prior probability Pr(Φu) can be assumed to be uniform, the posterior probability is proportional to the likelihood of the model, which can be used to select the best linkage phase configuration. Depending on the dosage and number of markers, some of these configurations are equivalent and will result in the same likelihood. The search space for the best linkage phase configuration can be unwieldy depending on the ploidy level, dosage and number of markers. Also, the transition space on the HMM gets larger as the ploidy level increases. To circumvent these problems, we propose a very efficient two-point procedure to reduce the search space for linkage phases.
Two-point algorithm for high-level autopolyploids
When the linkage analysis is conducted only in two markers (two-point analysis), the information contained in these markers does not propagate into the rest of the chain. Thus, based on the dosage and linkage phase configuration of the markers involved in the analysis, the gm genotypic states present in the full transition space can be collapsed into a small number of states, and a straightforward likelihood function can be derived. It is worthwhile to mention that the estimates obtained using the two-point procedure are the same as those obtained using the multipoint algorithm for two markers. However, the computation is extremely faster.
Consider a biallelic marker in an autopolyploid biparental cross with ploidy m. The number of possible genotypic states in the progeny for a given locus at position k is , where the operator and |.| denotes module. For example, in an autohexaploid biparental cross, if the dosage of the marker at position k in parent P is two and in parent Q is three , the number of possible genotypic classes expected in the progeny is six. Depending on the linkage phase configuration, each of the gm genotypic states in the full transition space corresponds to one of these expected genotypic classes, as presented in the emission function (Eqs 8 and 9). Thus, in the previous example, all the gm states could be collapsed into six different classes. To perform this reduction of dimensionality, let denote one of the possible genotypes based on the dosage of one individual in the progeny of an autopolyploid biparental cross for position k with ploidy m. The joint probability of and , for a given genotypic configuration at positions k and k′ can be written as where and δ(k, j) was defined in Eq 8; the same applies to Tk′. Since in a two-point analysis the probability distribution of the genotypic states in locus k can be assumed to be uniform, i.e., , Eq (19) can be rewritten as a sum of weighted terms from Eq (6) where h(j, j′; lP, lQ) is 1 if (j,j′) corresponds to (lP, lQ) according to the procedure described in S2 Appendix and zero otherwise. Eq 20 can be expressed in matrix form as where is a (m + 1) × (m +1) matrix. Yet, in a two-point analysis with biallelic markers, the linkage phase configuration can be summarized in an ordered pair indicating the number of homologous chromosomes that share allelic variants for loci k and k′ in parents P and Q, respectively. For a given pair , where and denote the set of homologous chromosomes inherited by parent P in positions k and k′, which can be assessed using the superscripts in and . indicates the cardinality of the set. Notice that and can assume several linkage phase configurations resulting in the same . Let denote a set containing all possible pairs for a given pair . In this set, there are min partitions, each one corresponding to a different . Fig 3 shows an example of for in an autotetraploid homology group. The size of the set is 36, and it can be subdivided into three partitions where and .
In a two-point context, the likelihood function derived from any of the configurations belonging to the same partition (same ) will be the same. Thus, any of them can be used to obtain the likelihood function for a given . Let denote one of the possible pairs that correspond to . The same reasoning applies to parent Q. Without loss of generality, the two-point likelihood function of biallelic observed molecular phenotypes for markers k and k′ given and is where n is the number of individuals and T denotes transposition of a vector. In Eq (22), rk can be estimated using iterative procedures such as EM or Newton-Raphson. As in Eq (18), it is possible to list all linkage phase configurations and evaluate them based on their likelihood. Here we use the LOD Score (base-10 logarithm of likelihood ratios) in relation to the highest likelihood. Thus, models with high likelihoods will yield LOD Scores close to zero. We also use the LOD Score to asses the evidence for linkage between the two markers using the ratio between the model under and under the null hypothesis of no linkage Ho: r = 0.5, given a linkage phase configuration.
As previously shown, it is possible to enumerate all linkage phase configurations for parent P using the Cartesian product . To reduce this Cartesian space based on two-point analysis, we add a restriction where all pairs in a sequence of configurations must be contained in , where is a subset of all partitions in in which the associated LOD Sore is smaller than η. Thus, a reduced subset of linkage phases in parent P based on two-point analysis can be obtained using
It is important to note that it is not necessary to represent the whole Cartesian space {ΦP} to restrict the linkage phase configurations to the condition . This procedure can be done through the sequential addition of markers from M1 to Mz. For each marker Mk′ added to the end of the chain, the ordered pair (k, k′), k′ = 2, …, z and k = k′ – 1, …, 1, is evaluated and only linkage phase configurations that meet the condition are considered.
Some of the configurations selected using the previous procedure can be equivalent once they are products of a permutation of the same set of homologous chromosomes. In order to remove this redundancy, let each one of the selected configurations be represented as a binary matrix of dimensions (m × k′) such as where u ∈ {1, …, U}, U is the number of selected linkage phase configurations, and k′ indicates that Mk′ was the last marker inserted in the chain. The rows of matrix represent the homologous chromosomes for the u-th linkage phase configuration with the insertion of the k′-th marker at the end of chain; 1 denotes the presence of an allelic variation, and 0 denotes its absence. If a matrix Hk′ could be obtained from a matrix just by permuting the rows (permuting the order of the homologous chromosomes), these two linkage configurations yield the same likelihood. Thus, one of the configurations should be excluded from consideration. The same reasoning applies to parent Q. This procedure can be done recursively until all redundancy is eliminated. The reduced linkage phase configurations search space considering both parents is obtained using Ф(η) = ФP(η) × ФQ(η), such as #{Ф(η)} ≪ #{Ф}, combined with the redundancy elimination for homology groups. This sequential procedure results in a set of linkage phase configurations containing markers up to Mk′, which are evaluated using the HMM likelihood. A LOD Score threshold in relation to the most likely configuration is assumed to determine which configurations should be taken into consideration in the next round of marker inclusion (Fig. 4).
Finally, with all markers inserted, the multipoint likelihood of the whole map is used to find the best configuration among the remaining ones, and the recombination fractions are reestimated. To demonstrate the mechanics of the two-point analysis coupled with the multipoint procedure, a simple example is presented in S3 Appendix. All the methods and procedures described here are available in a software called MAPPoly, which can be accessed at https://github.com/mmollina/mappoly.
Simulations
Simulation 1 - local performance under random bivalent pairing
the aim of this simulation study was to evaluate the local performance of the algorithm considering three ploidy levels (m = 4, m = 6 and m = 8) under the mapping model assumptions (i.e., random pairing and bivalent formation). To be in accordance with molecular data that have been made available through sequence technologies, we simulated bi-allelic markers that can be observed in terms of dosage in parents and progeny. Three different linkage phase scenarios were simulated: In scenario A, for each marker, if the dosage was greater than zero, one of the allelic variants was assigned to the first homologous chromosome in the homology group and the remaining variants of the same type were assigned to the subsequent homologous chromosomes. In B, the allelic variant was randomly assigned to one of the first homologous chromosome and the remaining were assigned to the subsequent homologous chromosomes; in scenario C the allelic variants were randomly assigned to the m homologous chromosomes. Thus, it is expected an increasing difficulty to detect recombination events from scenario A, where the allelic variants were concentrated in the same homologous chromosomes, to scenario C, where they are randomly distributed. Consequently, the phasing and recombination fraction estimation become more challenging from scenario A to scenario C. In real situations, scenarios A and B could occur locally due to lack of recombination between homologous chromosomes since their polyploid formation, whereas scenario C represents regions with higher recombination rates.
For each combination of ploidy level and linkage phase scenario, we simulated five different parental haplotypes. In total, 45 parental configurations were considered (3 × 3 × 5, S4 Figure). For autotetraploid and autohexaploid configurations, we simulated 1000 full-sib populations. For autooctaploids, this number was reduced to 200 due to the high demand of computer processing required to reconstruct such maps. Each population was comprised of 200 individuals with one linkage group containing 10 markers positioned at a fixed distance of 1 cM between them. For each combination, the percentage of correctly estimated linkage phase configuration in each parent was recorded. Also, for the cases where the linkage phases were correctly estimated, we calculated the average Euclidean distance between the distances of the estimated and simulated maps using where is the vector of distances for a estimated map, d is the vector of distances for the simulated map, z is the number of markers and T indicates vector transposition. For example, a value of 1 cM indicates that the maps differ 1 cM in average from each other [42]. We used the sequential two-point procedure to reduce the search space assuming that linkage phase configurations with associated LOD < 3.0 should be investigated using HMM multipoint strategies (η = 3). For the remaining configurations evaluated using HMM, we kept those with LOD < 10.0 to be evaluated in the next round of marker insertion. Notice that, although the likelihood obtained for each map could be used as a criterion to evaluate the order of the markers, this was not considered in this simulation due to the computational demanding nature of the multiple simulations added to high ploidy levels, specially m = 8.
Simulation 2 - chromosome-wise performance under preferential pairing and multivalent formation
In this simulation study, we evaluated the performance of the algorithm in dense maps, allowing for multivalent formation and preferential pairing. We used Scenario C from the previous study as a template to simulate five tetraploid and five hexaploid parental haplotypic configurations, each one comprising 200 equally spaced markers with a final length of 100.0 cM (S5 Figure). For each parental configuration, we simulated 200 full-sib populations of 200 offspring considering a combination of three levels of preferential pairing (0.00, 0.25 and 0.50) and three levels of cross-like quadrivalent formation proportion (0.00, 0.25 and 0.50). No hexavalents were simulated in this study. For autohexaploids, the multivalent configurations were always composed by a cross-like quadrivalent plus a bivalent. The centromere was positioned at 20.0 cM from the beginning of the chromosome (subtelocentric centromere with arms ratio 1:4) to study the effect of the double reduction at the distal end of both chromosome arms. All simulations were conducted using the software PedigreeSim [56]. In addition to the statistics recorded in Simulation 1, we computed the rate of double reduction observed in each marker for all constructed maps using the “founderalleles” file provided by PedigreeSim. We also evaluate two values for the LOD Score threshold associated to the two-point analysis (η = 3 and η = 5). We used a multipoint LOD Score threshold of 10. 0. The R scripts to perform the simulations presented here can be accessed at https://go.ncsu.edu/mappoly-support-info.
Simulation results
Simulation 1
Table 1 shows the percentage of data sets where the linkage phase configuration was correctly estimated in both parents P and Q. In scenario (A) the method was capable of recovering the correct linkage phase configuration in all situations for all ploidy levels. In scenarios (B) and (C) there was a slight decrease on the ability to correctly estimate the linkage phase configuration, especially for m = 6 and m = 8. Although in these cases the percentage of correctly estimated linkage phases was lower, the numbers are considerably high, varying from 100% to 88.8%. This indicates a very good performance to estimate the linkage phase configurations, even using the two-point procedure to narrow the search space.
Fig 5 shows the distributions of the average Euclidean distances between the estimated and simulated distance vectors for the correctly estimated linkage phase configuration. In all cases, the majority of the recombination fractions were consistently estimated once the medians of all distributions are very close 0.5, with no practical problems in terms of mapping construction. These results show that, apart from a relatively small percentage of entangled linkage phase configurations, the method successfully performed the phasing and managed to estimate the recombination fraction of 10 markers in all situations evaluated.
Simulation 2
The proportion of correctly estimated linkage phase configurations for the dense chromosome-wise map is shown in Table 2. In general, results for tetraploid maps were superior when compared to results for hexaploid maps. It is also possible to observe a better performance for the threshold level η = 5 in comparison to η = 3. Similarly to Simulation 1, maps resulting from configurations with no preferential pairing or quadrivalent formation showed a high proportion of correctly estimated linkage phase configurations. Results ranged from 100% to 99% for tetraploid maps and from 100% to 84% for hexaploid maps. Different levels of quadrivalent formation rate had no substantial influence in estimating the correct linkage phase configurations in tetraploids. Within the preferential pairing level 0.0, the percentage of maps with correct linkage phases varied from 100% to 90%. For hexaploids, there was a decrease in this percentage as the quadrivalent formation increases from 0.0 to 0.50, with proportions varying from 100% to 70.5%. Especially for autohexaploids, there was a considerable variation between the five simulated configurations. This occurred, because the effect of the quadrivalent formation can be more pronounced depending on the level of information contained in a particular configuration. Also, the use of a more stringent two-point threshold η = 5, improved the performance of the phasing algorithm.
Within the preferential pairing level 0.25, results showed decay of correctly estimated linkage phases, which was more pronounced for hexaploid cases with threshold level η = 3, reaching a minimum value of 52.5% for parent Q in configuration 1. Again, the use of a higher two-point threshold level, η = 5, helped to improve this number to 68.5%. For preferential pairing level 0.50, there was a clear distinction between the results in tetraploid and hexaploid cases. In the former, the effect was not as pronounced as it was in the latter, where in several cases, the proportion of correctly estimated linkage phases was close to zero. As expected, the usage of a higher threshold level of η = 5 helped to improve the number of corrected estimated linkage phase configurations. Interestingly, for both cases with preferential pairing (0.25 and 0.50), the formation of quadrivalents had an overall tendency to improve the algorithm’s performance. This improvement was expected because when a quadrivalent is formed, each chromosome involved can exchange segments with two others, providing more information regarding their phase configuration.
Given a correctly estimated linkage phase, the recombination fractions were consistently estimated for all levels of preferential pairing with no quadrivalent formation. However, they were overestimated in the presence of quadrivalent formation. This effect was mainly observed at the terminal regions of the chromosome, especially in the long arm, where double reduction is more pronounced (Fig. 6). In this case, tetraploid maps were the most affected. This is in agreement with our expectations since in autohexaploid simulations, there was always the formation of a bivalent which was not involved in the double reduction process (although the rates of double reduction were very similar in both ploidy levels, Fig. 6). In addition to the quadrivalent, the bivalent serves as an extra source of information to access the recombination events.
The average Euclidean distances reflect the overestimation of recombination fractions in cases with quadrivalent formation, showing distributions with higher medians and interquartile ranges in tetraploid cases when compared to hexaploids (S6 Figure). Nevertheless, all the Euclidean distances distributions were located relatively close to zero, with a maximum value of 1.41 cM, indicating that although we observed overestimated recombination fractions towards the terminal ends of the chromosome, they were equally distributed, causing no severe disturbances in the final map. S7 Figure shows an example of the effect of increasing quadrivalent formation rate in autotetraploid and autohexaploid maps. As the markers get further away from the centromere, the recombination fractions become overestimated.
Discussion
Although the concept of linkage mapping is relatively simple, the combinatorial properties and increasingly missing information that arise from the multiple sets of chromosomes make the construction of genetic maps in high-level autopolyploids extremely challenging. In this work, we frame and solve two fundamental steps towards the construction of such maps, namely multipoint recombination fraction estimations and linkage phase estimation. Our method can be applied to biallelic codominant markers and, due to the flexibility of the HMM framework upon which it was derived, it can be extended to any type of molecular marker. The HMM used in this work takes into account the linkage phase configuration of the whole linkage group to estimate the recombination fractions between adjacent markers. An efficient two-point approach was also presented to reduce the search space of linkage phase configurations. As result, our method provides the likelihood of the model, which can be used as an objective function to compare different map configurations, including linkage phases and marker order. When considering experimental populations, our method is a generalization, for any even ploidy level, of well established genetic linkage mapping methods. For diploid (m = 2) populations derived from biparental crosses, our method is equivalent to the influential Lander and Green algorithm [41]; considering full-sib phase-unknown crosses, it is equivalent to [57]. For tetraploids (m = 4) the method is equivalent to [17], disregarding double reduction. Thus, it encapsulates the essence of the HMM-based genetic mapping methods in a single one.
To assess the statistical power of our method, we conducted two simulation studies. Simulation 1 comprised three ploidy levels and three linkage phase configuration scenarios with ten markers. We demonstrated that our model was capable of correctly estimating the majority of parental linkage phase configurations and recombination fractions, even for complex linkage phase configurations and high ploidy levels. These well-assembled regions could function as multiallelic codominant markers which propagate their information through the HMM to the rest of the chain, improving the quality of the final map. In simulation 2, we analyzed a sequence of 200 markers in combinations of different levels of preferential pairing and rates of quadrivalent formation. In this situation, quadrivalent formation rate had a marginal effect on the phasing procedure, whereas preferential pairing reduced its performance, especially for autohexaploids. The usage of a higher two-point threshold (η) improved the linkage phase estimation in all cases. This fact indicates that the haplotype phasing is more accurate when HMM-based likelihood is used as objective function to evaluate linkage phases. We also observed that quadrivalent formation yield overestimated recombination fractions between adjacent markers located further away from the centromere. This behavior was expected since our model disregards double reduction and, consequently, was not able to correctly estimate the number of crossing over events when this phenomenon was present. Although our model is robust enough to cope with low levels of preferential pairing and tetravalent rate formation, it is possible to include both phenomena in specific points of its derivation. Preferential paring can be included in Eq 4 by not considering Pr(ψj) as uniformly distributed. Double reduction can be included in the definition of the genotypic states in the full transition space (Eq 5). These two phenomena add extra layers of complexity to the genetic mapping of polyploid organisms with high ploidy levels and should be addressed in future studies.
The difficulty in correctly estimating entangled linkage phase configurations lies in two major aspects of the experiments studied here: (i) the outbred nature of the experimental crosses and (ii) the incomplete information of the markers based on dosage (i.e., by not being multiallelic). In experimental population derived from inbred lines, the origin of the haplotypes can be easily inferred from the genetic design. However, obtaining pure inbred lines in high-level autopolyploids has been proven to be impractical due to the high number of crosses and generations necessary to achieve homozygous genotypes and to the inbred depression which some species undergo [61]. In our method, the linkage phase configuration is obtained by comparing the likelihood of a set of models with different linkage phase configurations (Eq 18). The capability of estimating the correct configuration is directly related to the information contained in the marker data. Some of these limitations can be overcome through the use of HMMs which take into account the information of a whole linkage group.
HMMs provide an excellent avenue to assemble genetic maps in complex scenarios, but they are remarkably computational demanding and, in some cases, unfeasible to use. Apart from parallel computing, which can greatly speed up the estimation process and is ubiquitous nowadays, the usage of two-point approaches is a viable option to reduce the dimension of the original problem efficiently. The dimension reduction is achieved by collapsing genotypic states in the full transition space according to the marker information. However, in several cases, the two-point based method can result in low statistical power which is related to the amount of information contained in markers in certain combinations of allelic dosage and linkage phase configurations. This lack of information is exacerbated as markers get distant from each other. Fig 7 shows eight possible configurations of pairs of markers in one autohexaploid parent. Considering the other parent non-informative, we computed the Fisher’s information equations based on the likelihood Eq (22) [15, 33, 62]. The equations were plotted as a function of the recombination fraction. The information profiles are related to the number of different haplotypes present on the parental configuration for a given marker dosage. For instance, for two single-dose markers (Fig 7, panel I), when the alleles share the same homologous chromosome (wk = 1), it is always possible to detect if the gamete contains at least one recombinant chromosome. However, when the alleles are in different homologous chromosomes (wk = 0), the detection of recombination events is limited to meiotic configurations containing a bivalent where these chromosomes paired to each other. Additionally, the model proposed here contemplates both parents on the analyses, leading to more complicated linkage phase configurations and information equations.
The multipoint procedure improves the power to detect genetic linkage since the information on the markers depends not only on the observed molecular phenotype for the locus in question but also on the accumulated information along the Markov chain. Fig 7(I) shows that maps using only single-dose markers are limited to the detection of markers whose allelic variants are the same homologous chromosome (wk = 1). Thus, the homologous chromosomes are treated as separate entities, instead of belonging to a homology group, and it is not possible to assemble haplotypes on the parents considering all homologous chromosomes (i.e., linkage phase estimation). Due to the lack of appropriate statistical methods, the use of diploid approximations considering single-dose markers has been the method of choice to build genetic maps in high-level autopolyploids. In our experience with construction of genetic maps in sugarcane [63-66], it is possible to anticipate a great gain of quality in those maps when using the new method proposed in this work. We also expect the same improvement for other high-level autopolyploid species.
The intrinsic lack of information in biallelic markers can be circumvented using multiple markers clustered in linkage disequilibrium (LD) blocks to assemble multiallelic marker data. Two different approaches can be used: the first one relies on the usage of high throughput molecular data and subsequent estimation of pairwise recombination fraction between the markers. In this case, due to the density of the data, closely linked markers are expected, and the Fisher’s information for the two-point maximum likelihood estimator is high (Fig 7). Thus, the determination of linkage phase configurations between markers in small blocks can be successfully achieved by using two-point methods (for a detailed example, see S3 Appendix). Once these LD blocks are well assembled, including the correct linkage phase configuration of both parents, they can be regarded as multiallelic markers. Simulation 1 showed that using two-point procedures coupled with the multipoint analysis is a trustworthy way to assemble haplotypes with closely linked markers. Another approach relies on a priori information about markers belonging to the same genomic region where recombination events can be neglected. This information can be obtained using any reference such as genomic or transcriptomic information. In this case, the recombination fraction can be assumed to be r = 0 for any pair of markers belonging to the LD block and the linkage phase configuration can be obtained using a trivial Markovian process, with transition probabilities tk(j, j′) = 1, ∀ j = j′ and tk(j, j′) = 0 otherwise. Therefore, the biallelic information contained in SNP markers can be combined to assemble haplotypes which will represent alleles allocated in different homologous chromosomes.
The multipoint method proposed herein rely on biallelic marker information. However, the emission function (Eq 9) can be modified to incorporate multiallelic observations. When using multiallelic markers, the number of states that should be visited in the Markov model can be significantly reduced, making the HMM procedure much more efficient. Ideally, in a full-sib population, the number of different alleles should be as high as two times the ploidy level (fully informative). In this case, the Markov model would be fully observed and, the task of estimating recombination fraction reduces to count the number of recombinant events given a linkage phase configuration. Since our algorithm does not need the entire transition space to work, only a subset of states should be visited, making the calculation much faster when compared to the biallelic case.
It is worthwhile to mention that, in this paper we do not address the step iii mentioned in the Introduction section, namely, ordering of genetic markers. The genetic mapping literature has an extensive body of methods to address the problem of ordering markers. Several works evaluated some of these methods [42, 67, 68] and others were proposed since then [47-49]. A fundamental lesson learned from these works is that, in complex linkage phase configurations with partially informative markers, methods based on multipoint likelihood provide better results when compared with two-point based methods. However, the multipoint procedures are highly compute-intensive. In the case of high-level autopolyploids, while it is important to rely on the multipoint estimates to recover the lack of information in the biallelic markers, it is also fundamental that the method is fast enough to cope with hundreds of markers per linkage group. One possible solution to these problems is to use two-point information to build marker blocks with a small number of SNPs in high linkage disequilibrium using some clusterization process. The linkage phase within these blocks can be estimated using a combination of two-point and HMM procedures. Then, these marker blocks can be used as multiallelic markers to reduce the number of states that need to be visited in the HMM. The more informative the assembled marker blocks are, the faster is the reconstruction of the mapping using the HMM. Moreover, in several situations, genomic and transcriptomic references are available and often provide, at least, the local physical order of SNPs. Thus, instead of using two-point information to cluster the SNPs into marker blocks, they can be assembled using genomic or transcriptomic references. While this paper provides fundamental steps towards the construction of complete genetic maps in high-level autopolyploids using both multipoint and two-point procedures, the practical aspects and implications will be addressed in future studies.
Once the map is assembled, it is a trivial exercise to obtain the probability of a specific genotype at any map position, conditioned on the whole linkage group. Using this information, it is possible to compute the probability of any unobserved genotype given the genetic map. These conditional probabilities are the basis for answering a series of fundamental questions about quantitative trait loci analysis in high-level autopolyploids, such as the effect of the dosage level on the variation of quantitative traits, the interaction of the alleles within (dominance effects) and between loci (epistatic effects). Therefore, the present study will provide a sound basis for the next step of genetic studies in high-level autopolyploids, trying to unveil the complex structure of autopolyploid genomes through genetic mapping and genome assembling, and even for studying the genetic architecture of quantitative traits based on QTL mapping.
Supporting information
S1 Appendix. Algebraic simplifications for transition probabilities.
S2 Appendix. Algorithm for obtaining lP and lQ given two genotypic indices.
S3 Appendix. Example of usage of the two-point and multipoint procedures. In order to show the mechanics of the mapping reconstruction using the combination of two-point and multipoint strategies, we present a simple full-bib autotetraploid mapping population example. This example is easily extendable to higher ploidy levels, since it does not involve matrix forms whose high dimensions would preclude the operations.
S4 Figure. Haplotypes for simulation study 1 Simulated haplotypes with 10 markers and three ploidy levels, namely autotetraploid (m = 4), autohexaploid (m = 6) and autooctaploid (m = 8).
S5 Figure. Haplotypes for simulation study 2 Simulated haplotypes with 200 markers and two ploidy levels, namely autotetraploid (m = 4) and autohexaploid (m = 6).
S6 Figure. Boxplots of the average Euclidean distances between the estimated and simulated distance vectors for simulation study 2
S7 Figure. Examples of autotetraploid and autohexapoloid maps estimated from datasets with three quadrivalent formation rates: 0.00, 0.25 and 0.50
Acknowledgments
The authors wish to thank Dr. Guilherme da Silva Pereira and Dr. Zhao-Bang Zeng for their invaluable suggestions for elaboration of the manuscript.
Footnotes
↵* mmollin{at}ncsu.edu(MM), augusto.garcia{at}usp.br(AAFG)