Abstract
Studies of human mitochondrial (mt) genome variation may be undertaken to investigate the human history and natural selection. By analyzing nucleotide co-occurrence over the entire human mt-genome, we have developed a network model to describe human evolutionary patterns. Using 7,424 unique polymorphic sites, we found evidence that mutation biases at second codon position and RNA genes were critical to producing continental-level heterogeneity among human subpopulations. Further, the analysis highlighted richer comutation regions of the mt-genome and thus provided evidence of epistasis. Specifically, a large portion of COX genes co-mutate in Asian and American populations whereas, in African, European and Oceanic populations, there was greater epistasis in hypervariable regions. Very interestingly, this study demonstrated hierarchical modularity as a crucial agent for a nucleotide co-occurrence network make-up. More profoundly, our ancestry-based nucleotide module analyses showed that nucleotide co-changes cluster preferentially in known mitochondrial haplogroups. It was also conceived that contemporary human mt-genome nucleotides most closely resembled the ancestral state and very few of them were ancestral-variants. Overall, these results demonstrated that subpopulation based factors such as intra-species evolution do exert selection on mitochondrial genes by favoring specific epistatic genetic variants.
Background
Genetic polymorphism varies among species and within genomes and has important implications for the evolution and conservation of species. Polymorphism in the mitochondrial (mt) genome is routinely used to trace ancient human migration routes and to obtain absolute dates for genetic prehistory [1]. The human mtgenome is very small (16.6 kb), maternally inherited, evolves in both neutral and adaptive fashion, and shows a great deal of variation as a result of divergent evolution. An absence of recombination within mt-genome provides distinct polymorphic loci which have been used to define human genealogy by defining mt-genome haplogroups [1]. These haplogroups are formed as a result of the sequential accumulation of mutations through maternal lineages. Since mitochondria are essential to cellular metabolism, the mt-genome variation has been associated with multiple complex diseases including Alzheimer’s disease in haplogroup U [2], idiopathic Parkinson disease within JT haplogroup [3] and age-related macular degeneration with the JTU haplogroup cluster [4]. Due to population migration, distinct lineages of mt-genome are associated with major global groups (African, American, European, Asian and Oceanic) raising the possibility that mtgenome variation could contribute to the differences in disease prevalence observed among both ethnic and racial groups [5].
Conventionally, analyses of mt-genome evolution have been focused on individual mutations, particularly in describing haplogroups, to understand and predict ancestral behavior. These approaches effectively identify single mutations. However, the evolutionary behavior of mt-genome often involves cooperative changes within and between genes which are difficult to detect using haplogroup analysis. For example, correlated mt-genome mutations were reported among different oxidative phosphorylation subunits which were found to affect population specific human longevity [6]. Besides, cooperative activities of both mitochondrial proteins and tRNA genes are critical for mt-genome evolution. The importance of co-mutational interactions has been well documented in the genomics field [7, 8]. Increasing evidence suggests that interactions among polymorphic sites may confer a cumulative association of multiple mutations with many diseases [8]. Interactions among polymorphic sites have also effectively been used to infer ancestry and functional convergence in the human populations using mt-genome co-mutations [9]. Commonly used methods include tree ensembles, functional nodal mutations and single nucleotide polymorphism (SNP) based enrichment [10]. Important information about mt-genome evolutionary behavior, which is contained in the correlated changes between nucleotide positions both within genes and between genes, is not captured by these techniques. Despite strong evidence that mt-genome variation plays a role in the development and progression of complex human diseases, the mitochondrial genetic variation has been largely ignored in the context of co-mutations and particularly the mechanisms by which these co-mutations occur [11, 12]. Investigation of joint polymorphism effects can, therefore, improve the explanatory ability of genetics twofold. Firstly, the interaction between two informative genomic positions to explain a part of the trait heritability. Secondly, finding significant statistical links between mutations could provide strong indications of molecular-level interactions that differ between different populations [13].
Complex network science revolves around the hypothesis that the behavior of complex systems can be elucidated in terms of structural and functional relationships between their constituents employing a graph representation [14, 15, 16, 17]. The basis of the current study is that genome positions can impact each other and co-occur within genomes [18, 19]. The interaction between two or more genetic loci is referred here as the co-occurrence of nucleotide positions. Herein, nucleotide positions are network nodes and if they cooccur together with a co-occurrence frequency, they form an edge. Co-occurrence among a pair of nucleotide positions is a ratio of frequency between a nucleotide pair co-occurring and nucleotides in a pair occurring separately in a set of genomes. There are previous studies which have used genomic co-occurrences as a basis of the evolution of human H3N2 and Ebola viruses [19, 20]. These viral genome models have identified the co-occurring nucleotide clusters, apparently underpinning the dynamics of virus evolution since these clusters were antigenic regions of the viral capsid proteins [19, 20]. In another study, Shinde et al. [18] demonstrated the impact of codon position bias while forming nucleotide co-occurrences using human mtgenome. These studies have considered perfect nucleotide co-occurrence as causing factor for co-mutations. However, the role of the co-occurrence frequency in these studies remains unclear. Here, we thoroughly examined a set of networks associated with a range of co-occurrence frequencies and chose a particular cooccurrence frequency for further network construction. Whilst, pair-wise nucleotide co-occurrences can be straightforwardly perceived, however, the identification of larger sized functional units is not straightforward. Here, we used community detection algorithms to enumerate lists of modules formed within networks and described the functional relationships among nucleotide positions forming these modules.
We set out to develop a comprehensive approach to understand mitochondrial diversity using mitochondrial co-mutations. To this end, we conducted a comparative analysis of 24,167 sequenced mitochondrial genomes. The paper is organized as follows. In the first two sections, we described the level of diversity observed among underlying subpopulations concerning polymorphic site variations in human mitochondrial genomes. In the third section, we provided a simple framework to investigate co-mutations which are critical in underlying complex mitochondrial evolution. To this end, we constructed nucleotide co-occurrence networks which were used to identify modules of co-mutations and also made their comparison with corresponding random networks. We found both similarities and crucial difference among networks under consideration. In the fourth section, we identified local topological phenomena which were crucial agents for genomic networks’ make-up. We listed down modules comprised of co-mutations and demonstrated that the identified modules indeed correspond to ancestry based associations. Overall, revealing the importance of co-mutational biases among different human subpopulations, our analysis identified local preferences which were key agents in forming mt-genome epistatic interactions.
1. Methods and Material
1.1. Acquisition of genomic data
We prepared an extensive collection of mitochondrial genomes of geographically diverse Homo sapiens populations (Fig. 1) from the Human Mitochondrial Database (Hmtdb) [21]. All downloaded genome sequences were in FASTA format. In total, the dataset comprised of 24,167 mitochondrial genome sequences from the five world continents, including 3426 African (AF), 2650 American (AM), 8483 Asian (AS), 8060 European (EU) and 1548 Oceanic (OC) genomes. Antarctica was excluded from the present analysis since no data was available. These continents were termed as genome groups. It should be noted that these genome groups are the multiethnic cohorts representing the range of admixture populations across the continent. A brief description of all the genomes and their origin is provided in S1 File.
1.2. Construction of nucleotide co-occurrence networks
Nucleotide co-occurrence calculations were carried out on each genome group independently. For each of these five genome datasets, we constructed primary nucleotide co-occurrence networks in which nodes represent genome positions, and edges between nodes represent co-occurring genomic mutations. There were total M primary nucleotide co-occurrence networks constructed for each genome group. Subsequently, we constructed five final nucleotide co-occurrence networks for five genome groups using these primary nucleotide co-occurrence networks. The methodology for constructing primary and final nucleotide cooccurrence networks schematically represented in Fig 1B and also given as follows:
1.2.1. Primary nucleotide co-occurrence network
All the steps of primary nucleotide co-occurrence networks construction were explained as follows. (1) Since the current study was based on the analysis of specific nucleotide position in mt-genome, we considered genome sequence data that was already end to end aligned. (2) All non-variable genome positions within samples of a genome group were removed. We were thus left with only the polymorphic genome positions. The count of polymorphic sites (NP) was given in Table 1. (3) Using only polymorphic nucleotide positions, we calculated the frequency of occurrence of all the nucleotide pairs f (xiyj) = N (xiyj)/M where, N (xiyj) denoted the number of co-occurrence pairs (xiyj) at position (i, j) and M denoted total number of samples in a genome group. Consequently, we calculated the frequency of individual occurrence of single nucleotides f (xi) = N (xi)/M and f (yj) = N (yj)/M where, N (xi) and N (yj) denoted the number of single nucleotides at their respective positions i and j [19]. (4) Co-occurrence of two nucleotides (CF) at position (i, j) was denoted as,
For a particular co-occurrence frequency threshold here we termed it as network efficiency score (described in the next subsection), we constructed primary nucleotide co-occurrence networks. A network can be represented mathematically by an adjacency matrix (A) with binary entries. As each genome sequence has its own information of co-occurring genome positions, a total M primary networks were generated for each genome group.
1.2.2. Final nucleotide co-occurrence network
Each of M primary nucleotide co-occurrence networks possess information of statistically significance edges. This simplest form of information was used to construct final nucleotide co-occurrence networks. Specifically, we listed down unique edges from all M networks of a genome group which make a final nucleotide co-occurrence network. The total of five final nucleotide co-occurrence networks was further used for network analysis and community detection.
1.3. Selection of network efficiency score (α)
Network efficiency score (α) was used to filter a group of edges required for network construction. To select an α value for each network, should require scanning of CF values among all pairs of polymorphic positions. To consider a network with CF values of least 10-4 precision would require the construction of 24,167 *104 networks in total, which would be a very intensive computational process. Therefore, we performed statistical sampling on each genome group interdependently by selection analysis of m samples from each M population. Sample size was determined by Cochran’s sample size formula [22] with critical value (z = 1.96). As the population was finite, the sample size was corrected by Cochran’s adjustment [22].
1.4. Analysis of nucleotide co-occurrence networks
The degree of a node (ki), which can be defined as a number of edges connected to the node . The clustering coefficient (C) is a measure of the extent to which nodes in a network tend to cluster together. An average clustering coefficient of a network can be written as Another property of the network which turns out to be crucial in distinguishing the individual networks was the assortative coefficient (r), which measure the tendency of nodes with the similar numbers of edges to connect. The assortative coefficient, r, was defined as the Pearson correlation coefficient of degree between pairs of linked nodes [23]. The value of r being zero corresponds to a random network whereas the negative(positive) values correspond to dis(assortative) networks. For a stringent comparison, the precise information of the number of nodes and edges of real-world nucleotide co-occurrence networks were used to construct randomized networks. This comparison allowed to estimate the probability that a randomized network with certain constraints has of belonging to a particular architecture, and thus assessed the relative importance of different architectures. The randomized networks of size N and average degree ⟨K ⟩were constructed using the ErdÖs Rényi random network model [24] by connecting each pair of nodes with the probability (p) which was equal to .
1.5. Detection of module structures in the network
In recent years, numerous approaches were proposed to determine the modular organization of complex networks wherein Girvan and Newman simplified graph-partitioning problems by introducing the concept of modularity. Modularity is conceptualized as the most widespread quantity for measuring the quality of a network partition, P. In its original definition [25], an unweighted and undirected network that has been partitioned into communities has modularity (Q) as . The indices i and j run over the N nodes of the network whereas C runs over the modules of the partition, P. Modularity calculates the number of edges between all combinations of nodes belonging to the same module and relates it to the expected number of such edges for an equivalent random graph. Therefore, modularity assesses how well a given partition incorporates the edges within the modules. We used the Louvain algorithm for community detection for our networks [26]. The Louvain method was a simple, efficient and easy-toimplement method for identifying communities in large networks. The python package of Louvain algorithm was used to enumerate module structures.
2. Results
2.1. Characterization of polymorphic sites
The starting point of our analysis was 24,167 sequenced mitochondrial genomes. We screened common and independent polymorphic sites present among five genome groups. A total of 18,824 polymorphic sites were screened from all the genome groups resulting in 7424 unique polymorphic nucleotide positions out of 16,929bp genome size (43% as also reported by [18]). The number of polymorphic sites (NP) for each genome group is summerised in Table 1 and Fig 2A. The number of independent polymorphic sites found in each population was 13.9% (AS), 7.3% (EU), 5.4% (AF), 4.1% (AM) and 1.4% (OC). Alignment of each genome against the Reconstructed Sapiens Reference Sequence (RSRS) showed the most and the least diverged sequences have 73 and 22 polymorphic sites respectively (Fig 2B). The mean number of polymorphic sites for each group was appromixely 50%, thus the extent of genomic diversity in each genome group was similar (Fig 2B). We also assessed the contribution of each genome in providing the count of polymorphic sites in an individual genome group by removing each time a single genome from a genome group and calculating the number of polymorphic sites prescribed by the rest of genomes (Fig 2C). The minimum and maximum contribution observed was one and ten unique polymorphic sites per genome respectively. More than 99% of polymorphic sites in each genome group were contributed by individual genomes having 1-4 unique polymorphic sites (Fig 2C), suggesting a genome group with a higher number of genomes would yield more polymorphic sites. However, the trend was not straight-forward (Table 1 and Fig 2C). When we calculated the ratio of number of genomes in a genome group to number of polymorphic sites yielded, it was observed that AM had the largest ratio (1: 1.35), followed by AF (1: 1.08), OC (1: 1.01), AS (1: 0.64) and EU (1: 0.57). Overall, molecular sequence-based diversions concerning polymorphic sites were very well conserved across genome groups as well as independently maintained within genome groups.
2.2. Classifying mt-genome Diversity
We evaluated the diversity in genetic regions of the mt-genome: 13 protein-coding genes, 22 tRNA and two rRNA genes, four loci in the non-coding region and a few other non-coding positions dispersed throughout protein-coding and non-coding region. To quantitatively assess this broad diversity, we examined the observed polymorphisms among genes against gene size and possible substitutions that can arise at each codon positions (Table 2).
2.2.1. Variations in Protein-Coding Genes
Considering only substitutions, 5,465 of 11,387 nucleotide positions (48%) located in protein-coding regions (double counting few polymorphic sites detected in overlapping regions of two genes) were variable. To understand the role of each codon position (CP), we broke the list of polymorphic sites down into the individual CPs and compared these to the total number of possible changes at each position. It was found that 60% of polymorphic sites was located at CP 3, 26% at CP 1 and 14% at CP 2. There was a good correlation between the observed and the maximum number of possible changes at CP 1 in each of the 13 protein-coding genes (r2 = 0.7063; Fig 4). COX1 and ND4 had the lowest proportion of observed polymorphic sites compared to all possible ones, and ATP6 had the largest proportion (Table 2). At CP 2 there was not only a smaller number of polymorphic sites but also a weak correlation (r2 = 0.3692) between the number of observed and possible changes at this position. In which COX1, ND4 and ND5 had the smallest proportion of polymorphic sites and both ATP syntase genes, ATP6 and ATP8, has the largest proportion of polymorphic sites (Table 2). Interestingly, there was a very strong correlation (r2 = 0.9986) among the observed, and the maximum number of polymorphic sites at CP 3 and all genes unanimously showed high diversity at CP 3 (Table 2). Overall, both ATP synthase genes ATP6 and ATP8 have demonstrated high mutation ability at all three CPs (Table 2) which corresponds to a higher gene diversity among all protein-coding genes.
Out of 5465 polymorphic sites, 4670 (85%) have two alleles, 741 (14%) have three alleles, and 54 (1%) have four alleles (Fig 2D). Moreover, when we screened polymorphic sites with two alleles, it turned out that 95% mutations were transitional, giving a transition: transversion ratio of 1:17.2, indicating transversion mutations are more likely to occur than transitions. The dominance of transition substitutions in the evolution of animal mt-genome (not just human mt-genome) has long been appreciated [27].
2.2.2. Variations in the control region, RNA genes and noncoding sites inside the coding region
26.8% of polymorphic sites were observed in the non-coding region (including tRNA), dispersed throughout the 5539bp region. Out of 1502 polymorphic sites among non-coding DNA, 1161 positions have two alleles, 280 positions have three alleles, and 61 positions have four alleles (Fig 2D). The ratio of observed transitions and transversions was 1: 8.2 which was much smaller than protein-coding genes. When comparing observed and possible changes at non-coding positions in each of the seven non-coding regions, a good correlation (r2 = 0.8296; Fig 4F) was obtained with HVS-I having the highest proportion of observed polymorphic sites. For each non-coding region, the number of polymorphic sites was 26% in 16srRNA; 27% in 12srRNA; 44% in D-loop; 47% in noncoding sites inside the coding region; 59% in HVS-II; 67% in HVS-III; and 68% in HVS-I. Furthermore, out of 490 polymorphic positions among all tRNAs, 434 positions have two alleles, 48 positions have three alleles, and only eight positions have four alleles (Fig 2D). Similarly to the non-coding region, the ratio of translations and transversions was 1: 8.64. Given that the sizes of the various tRNAs are quite similar, varying only from 61 to 84 bp, there were some interesting differences in variability between the tRNA genes. The genes for the tRNAs Met, Tyr, Glu, Leu and Asn showed the fewest polymorphisms (less than 15), whereas all the other tRNAs showed between 16 and 33 polymorphisms, except threonine Thr, which had 49 polymorphisms (Fig 4E). This observation renders the relationship between observed polymorphisms and size totally uneven (r2 = 0.05).
2.3. Evolution of mitochondrial co-mutations
Beyond simply measuring co-occurrence of these polymorphic sites, we evaluated the degree of correlation among pairs of polymorphic sites within the context of genomic associations. The selection of a co-occurrence frequency threshold (α) was a critical task. As zero α value would give co-occurrence among each mutation with all others whereas α equal to one would give only those pairs of mutations which have co-occurred perfectly in a genome group. In other words, zero α value would result in the globally connected network (Fig 3B) and α = 1 would result networks with many globally connected small sub-graphs (Fig 3E). Even though the α value attended a value as high as 0.999, networks remained very densely connected (Fig 3C). Therefore, it was reasonable to propose a criterion to select an α value, otherwise generated networks would be saturated structures holding no information about nucleotide co-occurrences. In order to tackle this, we plotted ⟨k⟩ and NLCC against all the α values. We observed a surprising network phenomena where at a particular α value, ⟨k ⟩ is small whilst NLCC was large. At this point, networks were sparser as compared to previous α values (Fig 3D). By a sparse network, we would mean that the majority of the elements of the adjacency matrix were zeroes. After attending this α value, the network breaks into several disconnected components.
Having this notion, we chose a particular α value for each genome group and constructed primary nucleotide co-occurrence networks. Although the α value applied to each genome group was very high (close to 1), this value was sufficient to capture more than 50% of the polymorphic sites in each genome group (except in AS; S4 Table and S2 Fig). We believed the α shortlisted evolutionarily important interactions among polymorphic sites as it preserved statistically important network edges. Further, it should be noted that α values for each genome groups were different (Table 1). This observation is intuitive since the number of polymorphic sites was different and therefore a preference for nucleotide co-occurrences should be different in each genome group.
2.3.1. Co-mutations displayed intraand intergenomic loci adaptation
Analysis of pairs of co-mutations provided an essential understanding of the relationship between two independent genome locations. Co-mutations can be formulated within a particular mitochondrial functional region (intra-loci) or between two functional regions (inter-loci). We compared co-occurrence configuration present among nine mt-genome functional regions. The number of polymorphic sites was normalized by the total count of co-occurring polymorphic sites in a genome group, and this information was stored in the co-occurrence configuration matrix and used to construct Circos plots (Fig 5). Nine mt-genome functional regions, comprising of four oxidative phosphorylation (OXPHOS) complexes, two RNA and three non-coding regions, displayed different preferences to co-mutate with other functional regions. In particular, OXPHOS complexes I, IV and HVS functional regions have a large contribution to the overall co-mutation configuration in each network. It was not surprising as these functional regions had large genomic lengths. Further, to know more on how each functional region has contributed in forming co-mutations, we plotted the count of co-mutations in each functional region against the corresponding functional region size for intraand interloci (Fig 5). It was observed that co-mutations among functional regions were evenly distributed among both intraand interloci in AM and AS. However, intra-loci were more evenly distributed as compared to inter-loci. Interestingly, we reported few functional regions found to be outside the 95% confidence intervals in both intraand inter-loci (Fig 5). For intra-loci, rRNA was an outlier in all populations, HVS in AF and OC whereas COX in AM and EU. For inter-loci, HVS was an outlier in AF, EU, and OC whereas COX in AM and AS. Apart from that ATP and miscellaneous regions were outliers in AM, tRNA in AS and rRNA in OC. These statistical outlier regions should have an assertive evolutionary role in a population. To explore this further, we studied how these groups were separated from each other. We first calculated Frobenius distances among each pair of five co-occurrence configuration matrices and then performed hierarchical clustering of the calculated pairwise Frobenius distances. A dendrogram clearly showed the separation of five genome groups into two main branches i.e. {AM, AS} and {AF, EU, and OC} (Fig 6).
To investigate global level co-occurrence preferences between functional regions, we analyzed unique co-mutations from all the genome groups. Fewer co-mutation pairs were formulated among intra-loci than those of inter-loci. This relationship between co-mutations and the spatial proximity would occur to be natural in mt-genome since all 13 protein-coding genes formed heavily interacting OXPHOS complexes [28, 29]. However, co-mutation pairs formed among OXPHOS complex I or ND genes which make 38% of total mt-genome, participated in 31% of inter-loci co-mutations but only 13% of intra-loci co-mutations. Both D-Loop and all three hypervariable regions displayed a tendency to co-mutate with almost all other mt-genome loci (Fig 5). The rRNA genes make-up 15% of total mt-genome but they participated in only 9% of co-mutating sites. All 22 tRNA genes which make 9% of total mt-genome participated in 10% of comutating sites. Overall, co-mutations dispersed among mt-genome functional regions showed that formation of co-mutations was driven mainly by local adaptive forces among each group. There are also preferences among functional regions to formulate inter- or intra-loci co-mutations.
2.3.2. Co-occurrence networks exhibited similar network properties
Pair-wise nucleotide co-occurrences were not sufficient to fully reveal the underlying structure of functionally related nucleotide positions. As described in Fig 1, a nucleotide co-occurrence network was constructed for each genome group where polymorphic sites forming co-mutations constituted the nodes, and the edges represented co-occurring nucleotide positions. The number of nodes and edges forming final nucleotide co-occurrence networks were found to be different for each genome group (Table 1). This observation was intuitive since the formation of co-mutations is totally determined by the number of polymorphic sites formed and the significance of their relationships within each group. To get an overview of the network-level organization of the genomic interactions, the topological properties of nucleotide co-occurrence networks were analyzed. It should be noted that selected networks were sparsed which means they have a very small average degree (Table 1). Furthermore, we investigated two essential network topological properties to understand local interactions in individual networks. All the networks exhibited high average clustering coefficient, ⟨C ⟩values (Table 1), suggesting that the nodes of these networks are densely connected. All five networks also displayed a highly negative value of the degree-degree coefficient (r) (Table 1). The negative r value suggested that networks were dis-assortative where high degree nodes, on average, tend to attach to low degree nodes [23]. Many paths between nodes in these networks were dependent on high degree node(s). Many biological and social networks displayed negative r value, suggesting that failure of a high degree node in a disassortative network have more impact on the connectedness of the network [23, 15, 30].
Overall, nucleotide co-occurrence networks have shown both the properties of high clustering and the presence of dissortative nature. This observation suggests the presence of dense subgraphs within the network and the presence of hierarchical structures. To explore more about the local interaction patterns in nucleotide co-occurrence networks, we investigated module structures within these networks.
2.4. High cohesiveness of nucleotide co-occurrence communities
This work is a first attempt to uncover the hierarchical organization of nucleotide co-occurrence networks. The major challenge for identifying modules in a hierarchical organization is to decide the depth to decompose network, as the Louvain algorithm can fragment networks and subsequently modules until it finds the greatest partition. In order to avoid large numbers of smaller modules (size 2), the size of the second largest connected component was used to decipher submodules among each hierarchy of parent modules. The size of the second largest connected component was 11, 8, 9, 6 and 12 for AF, AM, AS, EU, OC genome groups respectively. We calculated the modularity coefficient (Q) for five final nucleotide co-occurrence networks and also for corresponding random networks (Table 1). Q value was clearly reduced in the randomized networks, relative to the original data, indicating that our results on real nucleotide co-occurrence networks were not trivially reproduced in random networks. A high Q value will manifest if networks are modular in nature. There were 557, 571, 552, 622 and 227 modules obtained for AF, AM, AS, EU, and OC genome groups respectively. The full list of modules is provided in S2 File. In these networks, small sized modules (size less than 20) were predominant alongside one or two large sized modules i.e. AF (size of 119), AM (270), AS (217 and 216), AS (294) and OC (104) (S4 Fig). Interestingly, large sized modules were only comprised of polymorphic sites from non-coding regions (except in OC). Similarly to co-mutations, we also noted that polymorphic sites among each module could be from any of mt-genome loci. For example, in OC population module 59 had polymorphic sites only from COX1 gene, whereas module 3 had all polymorphic sites from different genes (S2 File). We noted that protein-coding functional regions have a predominant role in the formation of modules (S5 Table and S6 Table). Particularly, ND and COX participated in >65% and >40% of modules in each of five networks, respectively. Additionally, we also observed a total of 391 modules out of a total of 2529 modules where all polymorphic sites in the module were from a single functional group. Such mono-functional region modules were also prevailed by ND and COX functional regions, 70% and 14% of total mono-functional region modules, respectively (S6 Table).
2.5. Modules of co-occurring polymorphic sites indicated ancestral relationships
To investigate if the modules identified from the analysis of the network structure were evolutionarily related, we examined polymorphic sites in the individual modules with ancestral alleles from RSRS. If a nonRSRS allele was present in more than 1% of samples in a genome group, we termed it an ancestral-variant allele. Here, we used conventional definition of SNP to define ancestral-variant allele. Thus, we assigned ancestral-variant information to all of the network modules and noted three distinct types of modules (also schematically showed in S5 Fig). In the first and most common (more than 90% of total modules), all polymorphic sites were closely related to ancestral alleles (Table 3) and we termed them ancestral allele modules. All the polymorphic sites in these ancestral allele modules had ancestral alleles (or non-RSRS alleles present in < 1% of samples). Ancestral alleles were reported to be common throughout human mtgenome tree [31] and were also observed in large numbers in our genome group data (Table 3). In the second type of module, all the polymorphic sites were ancestral-variant alleles. We termed them as ancestral-variant modules and were of our particular interest. Ancestral-variant modules were observed the least out of three types of modules, both in terms of module count and the number of polymorphic sites present in these modules (Table 3). In the third type of module, polymorphic sites in a module were a mixture of ancestral and ancestral-variant alleles and we termed them mixed modules. The polymorphic sites among these modules were hypothesised to be recently diverged. Mixed modules comprised of the large-sized modules, therefore even though the module count was found to be lower, these mixed modules still possessed a higher number of nodes (Table 3). Full lists of these modules were mapped it to all known haplogroups and showed that each polymorphic site had contributed to one or many haplogroups. Although this observation was intuitive since every new polymorphism in the mt-genome have been successfully characterized in defining haplogroups [31], this mapping has given valuable information that entire module structure can be related to a single mt-genome haplogroup. For a complete list of modules and corresponding haplogroups for each polymorphic site in a module, see S2 File. Further, we showed the relationship among modules corresponding to ancestral haplogroup lineage markers (or top-level haplogroups). Thus, we characterized an entire list of modules for whether their polymorphic sites were related to ancestral lineage markers. Information of ancestral lineage markers was taken from the Mitomap database, and polymorphic sites among each module were mapped to ancestral lineage markers. Out of the total 350 ancestral lineage markers, most of them were present in the American population, followed by European, African, Asian and Oceanic populations (Table 3). These ancestral lineage markers were also observed to participate in the formation of entire module structures and there were a total of 38 such modules structures obtained (Table 3; File S1). Out of the observed 38 modules, where all nodes were ancestral lineage polymorphic sites, 23 were ancestral-variant modules, 13 were ancestral modules, and two were mixed modules. Since all polymorphic sites among these 38 modules were the ancestral lineage markers, it would be reasonable to say that not only sub-level haplogroups but also top-level haplogroup markers have shown a tendency to be associated to each other.
3. Discussion
We used comparative genome analysis to investigate 24,167 mt-genomes and devised a network model comprising pairs of co-occurring nucleotides over the length of the human mt-genome. The method presented here can provide a new perspective on epistatic mutation as well as serving as a comparative tool for intraspecies evolution. Our study showed the presence of heterogeneity in both epistatic mutations and functional modules across investigated genome groups.
3.1. Polymorphism among mt-genome loci
Mitochondrial DNA is one of the most preserved genomes which is highlighted by the observation that maximum divergence from RSRS was only 73 bp (∼0.005% of mt-genome) and a large proportion of mtgenome polymorphic sites possessed as few as two alleles. However, our study also reported as high as 43% mt-genome positions were at least once mutated. These observations strongly suggest that although individual mt-genomes have very few divergences, a large number of genome positions have been utilized to provide intra-species separation. Furthermore, the comparison of observed polymorphisms with gene size clearly showed two essential features in providing maximum functional level diversity with the minimum level of genomic changes. First, genetic conservation at CP 2 but not at CP 3, was key to provide a structural diversity of mt-genome complexes. Second, the restriction of mutations in structurally important genes of tRNA and rRNA. These two observations of biases against mutations at CP 2 and RNA genes were earlier reported by Pereira et al. with 5140 human mt-genome [32]. The similar positive selection of CP 2 and tRNA genes was also reported among mitochondrial genomes of other primates including Macaca, Papio, Hylobates, Pongo, Gorilla, and Pan whereby the constraint of selection was determined in each lineage by the ancestral state of each codon position [33]. Among non-coding genes, all three HVS regions have displayed a higher level of polymorphisms whereas genes of rRNA and tRNA have shown lower levels of polymorphism. Our study, apart from providing the detailed enlisting of diversity present among five genome groups, reiterated that both codon level mutation bias and restriction of mutations among RNA genes were more evident at the subpopulation level which earlier reported to be at global level. Furthermore, given the ubiquitous variation in mt-genome, genetic flexibility may have evolved as a mechanism to maintain OXPHOS under a range of environments.
3.2. Evolution of co-mutations
Despite the clear and reasonable biases against polymorphisms at CP2 in the protein-coding genes, our analysis indicated similar other biases were also evident. Firstly, our results with intraand interloci adaptations clearly suggested the dominance of polygenic mutations in human mt-genome. On a protein level, the richness of co-ordinated mutations between mitochondrial complexes will affect protein-protein interactions within individual OXPHOS complexes as well as supercomplex interactions between electron chain transport complexes I, III, and IV in the respirasome. The communication among OXPHOS complexes is driven by a highly constrained selection [34] and also by protein-protein interactions of Mito-interactome [35]. Second, our analysis highlights regions of the mt-genome rich in co-mutations and thus provides evidence of epistasis. In particular, a large portion of COX genes co-mutate in AS and AM populations whereas in AF, EU and OC populations, there was greater epistasis in functional regions of HVS. OXPHOS complexes consist of both nuclear and mitochondrial proteins, and nuclear-mitochondrial interactions are known to contribute epistatically in shaping mitochondrial evolution. However, epistasis has also been observed among mitochondrial genes, for example, the joint effect of genetic variants has been reported in Han Chinese family [36]. These two point mutations are known to act synergistically causing migration of gonadotropin-releasing hormone neurons [36]. Also, epistatic phenomena have been widely observed among mitochondrial tRNA genes guided by homoplasy [37]. There were similar other mitochondrial co-mutations which have reported to have the role in mitochondrial diseases [38]. We extended information accompanied by the two-loci genome model broadly by constructing nucleotide co-occurrence networks.
Third, similar to polymorphic sites, co-mutations also showed biases at the subpopulation level. Consistent with the proposed importance of mt-genome variation in human adaptation, regional haplogroups are generally founded by one or more functionally significant polypeptide, tRNA, rRNA, and control region variants. These variants are retained in the descendant mt-genomes creating the haplogroups [39]. Therefore it was intuitive to observe patterns of variants at the subpopulation level. Genome group-wise comparison of co-mutations associated among mt-genome functional regions has helped in classifying these five human subpopulations into two prominent groups i.e. {AF, EU, OC} and {AS, AM}. This result was supported by a global mt-genome mutational phylogeny which implicated few peculiar routes of human migrations [31]. Asian haplogroup M and European haplogroup N arose from the African haplogroup L3 [40]. Haplogroup M gave rise to the haplogroups A, B, C, D, G, and F [40] in which Haplogroups A, B, C, and D populated East Asia and the Americas. In Europe, haplogroup N led to the European haplogroups H, J, T, U, and V [41] whereas Haplogroups S, P, and Q are found in Oceania [31]. Overall, variations probed by epistatic interactions have provided local preferences among different mt-genome loci. These local preferences might have helped in not only forming the closed-assembly of OXPHOS complexes but also classifying subpopulations.
3.3. Discontinuous transition in nucleotide co-occurrence network
In the nucleotide co-occurrence networks, we found that α was the main edge filter when intersecting networks and also produced the optimum network in relation to size and architecture. In our network models, the emergence of sparse networks was not a smooth, gradual process: the very dense largest connected component collapsed into a sparse largest connected component through a discontinuous transition (Fig 3). For all five genome groups, we encountered such a distinct phenomenon. A similar critical phenomenon was first observed by Erdös and Rényi through their random network model where the isolated nodes and tiny components observed for small ⟨K⟩ would collapse into one largest connected component [24]. Similarly, such phenomenon was also observed in computer traffic network and biomolecular interaction network in acute lung injury, etc. [43, 44]. However, in these networks with a largest connected component, several smaller disconnected components were observed similar to our networks. Interestingly, the nature of such transformations was earlier related to neutral genetic drift and hypothesized that biological processes have proceeded in discontinuous transitions [42].
We selected edges for inclusion in co-occurrence networks based on their best fit to a network sparseness. Sparseness is one of the essential property of real-world networks, particularly biological networks since biological networks are usually large, and there is evolutionarily cost involved to form more links and hence links are more difficult to create. It is well known that co-mutational events are very selective events which require a group of supporting mechanisms to perform co-activity [19]. Previously, the similar approximation for the inclusion of edges was used in microbial interaction networks using methods like WGCNA [45]. Using a network sparseness or similar data-driven approach avoids entirely arbitrary selections of network edges and provides a uniform rationale that can be implemented to generate co-occurrence network structures across different genome datasets. Therefore, it was reasonable to choose an α value where a network should have both the lowest value of ⟨k⟩ value and the largest component with a higher count of nodes.
3.4. Hierarical modularity
Following network construction, we utilized a modularity maximization algorithm to detect communities in the network. There was clear evidence for hierarchical modularity in our genome datasets, and the modular structure of the networks at all levels of the hierarchical patterns was reasonably similar across genome groups, suggesting that mt-genome functional modularity is likely to be a replicable phenomenon. This study provided a complete listing of the current knowledge of mt-genome variation in the human population, also with respect to higher level associations with hierarchical modules. Overall, we demonstrated that molecular changes, such as mutations, were not randomly distributed across the genome, but instead concentrated within modules. Modularity was one of the main features of nucleotide co-occurrence networks, and evolutionary processes may favor the emergence of modularity by a combination of molecular interactions and natural selection [46]. Similar hierarchical modularity in brain network was related to functional regions in the brain and sub-set of brain functions have been reported to be associated among each hierarchy [47]. Therefore, it was reasonable to say that selection may favor modularity allowing both the specificity and autonomy of functionally distinct subsets of genomic positions. In this sense, the concentration of genomic positions within modules provided a way to understand module integration, favoring distinct functional roles developed by genomic positions in distinct modules. In the human mt-genome, modules were associated with mitochondrial subcomplexes that act in distinct steps of the electron transport assembly and function. Thus, the closed assembly of mitochondrial complexes may favor the emergence of highly integrated genomic subunits, in which effects of pairwise interactions may also activate indirect effects on non-interacting genomic positions associated with the same function [34]. Based on these results, we would expect that genome positions connecting modules were more conserved across evolution or, at least, less prone to failures that alter their function.
3.5. Co-occurrence patterns among network modules
It would be expected that module level associations reflected evolutionary relationships among underlying genomic positions as each module were constituted of ancestrally similar genomic polymorphisms. Our results of modules added that distinction between the ancestral and the derived mitochondrial polymorphisms was clear in very few cases where the entire module was made up of ancestral-variant polymorphic sites. However, a large number of modules (more than 90% of total modules) were made-up of ancestral polymorphic sites. In addition, the large number of nodes in mixed modules were of ancestral origin (S2 File). Using the list modules among all five networks, it would be reasonable to assert that contemporary human mt-genome nucleotide bases most closely resembled the ancestral state and very few of them were ancestral-variants. This observation was in agreement with previous studies which found co-occurrence among nucleotide positions to be higher between genetically similar taxa [48]. This fact was widely observed in our data as both sub-level, and top-level haplogroup markers were associated with each other in a closed group of network modules. Haplogroup level association of genetic markers, particularly Haplogroup T markers, recently shown to be involved in risk of colorectal cancer [49]. Therefore, these evolutionarily closed associations suggest that interactions among nucleotide positions might evolve within genetically related genomic polymorphisms (more likely of having similar functionality) responding to intra-species adaptation. These associations between mutations have been reported to be a major driver of co-occurrence patterns in intra-species evolution. For instance, modules of H3N2 viral genome co-mutations were found to be important agents in mediating protein binding sites [19].
Beyond identifying network modules, understanding their formation would require an extension of the described approaches to quantification of each module using the evolutionary information they possess. Here, we used simplistic information possessed by each genome position in terms of their underlying ancestral marker information. Previously, this ancestral marker information has been used in order to define taxa (precisely haplogroups) in mitochondrial phylotree which have provided exact mapping of mitochondrial signatures to infer the routes of human intra-species diversification events in the past [50, 51]. Similarly, our nucleotide co-occurrence modules have provided a detailed listing of mitochondrial co-mutations which were ancestrally associated together.
4. Conclusion
We constructed and investigated the human mt-genome nucleotide co-occurrence networks among geographical regions using a genomics and network theory framework. Our principal result was that mitochondria undergo substantial levels of co-mutational biases. Codon-level mutation bias, particularly at CP 2, and restriction of mutations in RNA genes was even evident at the continental level which was earlier reported to be among the global human population. The analysis highlighted regions of the mt-genome that were rich for co-mutations and thus provided evidence of epistasis. In particular, a large portion of COX and ND genes found to be co-mutated in AS and AM populations whereas in AF, EU and OC populations, there was greater epistasis between regions of HVS and ND. Our networks identified differences in epistasis and codon-bias between human populations. From the co-occurrence network analysis of regional co-mutations, it was of great interest to investigate and verify the different co-occurrence patterns among mutations of various geographical regions. This analysis presented here can be extended further to study the complexity of the mt-genome evolution by forming various geographical groups as well as to understand alterations in personnel traits leading to complexity in mt-genome evolution. This understanding can provide the envelope of the network information which encodes the changes during the progression of the disease, where information of genomic alteration with time is available.
Competing interests
The authors declare that they have no competing interests.
Data Availability Statement
All data sources and related information is given in the manuscript and associated supporting information files.
Supporting information
S1 File. Information of genome samples considered in the study.
S2 File. Information of modules indetified in the study.
S1 Table. Network properties of Largest connected component.
S2 Table. Network properties of all disconnected components together except Lcc.
S3 Table. Level-wise community detection in five nucleotide co-occurrence networks. S3 Table. Comparison of polymorphism α pre and α post.
S4 Table. Distribution of modules having atleast one polymorphic site among mt-genome functional groups. ND and COX functional groups were having maximum participation among modules.
S5 Table. Modules comprising all nodes as ancestral lineage polymorphic sites. The count was dominated by ND and COX.
S1 Fig. Degree distribution of five nucleotide co-occurrence networks.
S2 Fig. Gene-wise comparison of polymorphism before and after α.
S3 Fig. Transition and transversion.
S4 Fig. Distribution of module sizes.
S5 Fig. Identification and characterization of network modules.
Author Contributions
Conceptualization: PS, SJ
Data curation: PS
Investigation: PS, HW
Methodology: PS, SJ, RKV
Supervision: SJ
Writing original draft: PS, SJ, HW, AZ, RKV
Acknowledgments
PS acknowledges Inspire fellowship (IF150200) from the department of science and technology (DST), Government of India. SJ thanks the support by grant of the ministry of education and science of the Russian Federation (Agreement No. 074-02-2018-330) and grants of DST (EMR/2016/001921) and the council of scientific and industrial research (CSIR, 25(0293)/18/EMR-II), government of India. RKV acknowledges CSIR-NET fellowship (roll no.: 305089) from CSIR, Government of India.